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Abstract: Given a dictionary of Mn initial estimates of the unknown true regression func- 
tion, we aim to construct linearly aggregated estimators that target the best performance 
among all the linear combinations under a sparse g-norm (0 < 5 < 1) constraint on the lin- 
ear coefficients. Besides identifying the optimal rates of aggregation for these ^^-aggregation 
problems, our multi-directional (or universal) aggregation strategies by model mixing or model 
selection achieve the optimal rates simultaneously over the full range of < g < 1 for gen- 
eral Mn and upper bound tn of the g-norm. Both random and fixed designs, with known or 
unknown error variance, arc handled, and the ^^-aggregations examined in this work cover 
major types of aggregation problems previously studied in the literature. Consequences on 
minimax-ratc adaptive regression under £q-constrained true coefficients (0 < g < 1) arc also 
provided. 

Our results show that the minimax rate of ^q-aggregation (0 < g < 1) is basically deter- 
mined by an effective model size, which is a sparsity index that depends on q, tn, Mn, and 
the sample size n in an easily interpretable way based on a classical model selection theory 
that deals with a large number of models. In addition, in the fixed design case, the model 
selection approach is seen to yield optimal rates of convergence not only in expectation but 
also with exponential decay of deviation probability. In contrast, the model mixing approach 
can have leading constant one in front of the target risk in the oracle inequality while not 
offering optimality in deviation probability. 

Keywords and phrases: minimax risk, adaptive estimation, sparse ^q-constraint, linear 
combining, aggregation, model mixing, model selection. 
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1. Introduction 

The idea of sharing strengths of different estimation procedures by combining them instead of 
choosing a single one has led to fruitful and exciting research results in statistics and machine 
learning. In statistics, the theoretical advances have centered on optimal risk bounds that require 
almost no assumption on the behaviors of the individual estimators to be integrated (see, e.g., 
[64, 67, 22, 24, 42, 52, 69, 58] for early representative work). While there are many different ways that 
one can envision to combine the advantages of the candidate procedures, the combining methods 
can be put into two main categories: those intended for combining for adaptation^ which aims at 
combining the procedures to perform adaptively as well as the best candidate procedure no matter 
what the truth is, and those for combining for improvement, which aims at improving over the 
performance of all the candidate procedures in certain ways. Whatever the goal is, for the purpose 
of estimating a target function (e.g., the true regression function), we expect to pay a price: the risk 
of the combined procedure is typically larger than the target risk. The difference between the two 
risks (or a proper upper bound on the difference) is henceforth called risk regret of the combining 
method. 

The research attention is often focused on one but the main step in the process of combining 
procedures, namely, aggregation of estimates, wherein one has already obtained estimates by all the 
candidate procedures (based on initial data, most likely from data splitting, or previous studies), 
and is trying to aggregate these estimates into a single one based on data that are independent of the 
initial data. The performance of the aggregated estimator (conditional on the initial estimates) plays 
the most important role in determining the total risk of the whole combined procedure, although the 
proportion of the initial data size and the later one certainly also influences the overall performance. 
In this work, we will mainly focus on the aggregation step. 

It is now well-understood that given a collection of procedures, although combining procedures 
for adaptation and selecting the best one share the same goal of achieving the best performance 
offered by the candidate procedures, the former usually wins when model selection uncertainty is 
high (see, e.g., [74]). Theoretically, one only needs to pay a relatively small price for aggregation for 
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adaptation ([66, 24, 58]). In contrast, aggregation for improvement mider a convex constraint or ii- 
constraint on coefficients is associated with a higher risk regret (as shown in [42, 52, 69, 58]). Several 
other directions of aggregation for improvement, defined via proper constraints imposed on the £o" 
norm alone or in conjunction with the ^i-norm of the linear coefficients, have also been studied, 
including linear aggregation (no constraint, [58]), aggregation to achieve the best performance of 
a linear combination of no more than a given number of initial estimates ([19]) and also under 
an additional constraint on the £i-norm of these coefficients ([49]). Interestingly, combining for 
adaptation has a fundamental role for combining for improvement: it serves as an effective tool in 
constructing multi-directional (or universal) aggregation methods that simultaneously achieve the 
best performance in multiple specific directions of aggregation for improvement. This strategy was 
taken in section 3 of [69], where aggregations of subsets of estimates are then aggregated to be 
suitably aggressive and conservative in an adaptive way. Other uses of subset models for universal 
aggregation have been handled in [19, 54]. 

The goal of this paper is to propose aggregation methods that achieve the performance (in risk 
with/without a multiplying factor), up to a multiple of the optimal risk regret as defined in [58], of 
the best linear combination of the initial estimates under the constraint that the q-norm (0 < 9 < 1) 
of the linear coefficients is no larger than some positive number t„ (henceforth the I q- constraint). We 
call this type of aggregation (.q- aggregation. It turns out that the optimal rate is simply determined 
by an effective model size , which roughly means that only m» terms are really needed for effective 
estimation. We strive to achieve the optimal ^q-aggregation simultaneously for all 5 (0 < 9 < 1) and 
tn {tn > 0). From the work in [42, 69, 58, 4], it is known that by suitable aggregation methods, the 
squared L2 risk is no larger than that of the best linear combination of the initial A/„ estimates with 
the £i-norm of the coefficients bounded by 1 plus the order (log(M„/-y/n)/n)^/^ when Mn > \fn or 
Mnjn when A/„ < ^Jn. Two important features are evident here: 1) When A/„ is large, its effect 
on the risk enlargement is only through a logarithmic fashion; 2) No assumption is needed at all on 
how the initial estimates are possibly correlated. The strong result comes from the ^i-constraint on 
the coefficients. 

Indeed, in the last decade of the twentieth century, the fact that ^i-type of constraints induce 
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sparsity has been used in different ways for statistical estimation to attain relatively fast rates of 
convergence as a means to overcome the curse of dimensionality. Among the most relevant ones, 
Barron [9] studied the use of ^i-constraint in construction of estimators for fast convergence with 
neural nets; Tibshirani [57] introduced the Lasso; Chen, Donoho and Saunders [25] proposed the 
basis pursuit with over complete bases. Theoretical advantages have also been pointed out. Barron 
[8] showed that for estimating a high-dimensional function that has integrable Fourier transform 
or a neural net representation, accurate approximation error is achievable. Together with model 
selection over finite dimensional neural network models, relatively fast rates of convergence, e.g., 
[((ilogn)/n]^/^, where d is the input dimension, are obtained (see, e.g., [9] with parameter discretiza- 
tion, section III.B in [71] and section 4.2 in [11] with continuous models). Donoho and Johnstone 
[30] identified how the ^^-constraint (q > 0) on the mean vector affects estimation accuracy under 
(.p loss (p > 1) in an illustrative Gaussian sequence model. For function estimation, Donoho [28] 
studied sparse estimation with unconditional orthonormal bases and related the essential rate of 
convergence to a sparsity index. In that direction, for a special case of function classes with uncon- 
ditional basis defined basically in terms of bounded g-norm on the coefficients of the orthonormal 
expansion, the rate of convergence (logn/n)^^'/^ was given in [71] (section 5). The same rate also 
appeared in the earlier work of Donoho and Johnstone [30] in some asymptotic settings. Note that 
when = 1, this is exactly the same rate of the risk regret for ^i-aggregation when A/„ is of order 
for 1/2 < K < oo. 

General model selection theories on function estimation intend to work with general and possibly 
complicatcdly dependent terms. Considerable research has been built upon subset selection as a 
natural way to pursue sparse and fiexible estimation. When exponentially many or more models 
are entertained, optimality theories that handle a small number of models (e.g., [56, 48]) are no 
longer suitable. General theories were then developed for estimators based on criteria that add an 
additional penalty to the AIC type criteria, where the additional penalty term prevents substantial 
overfitting that often occurs when working with exponentially many models by standard information 
criteria, such as AIC and BIG. A masterpiece of work with tremendous breadth and depth is Barron, 
Birge and Massart [11], and some other general results in specific contexts of density estimation 
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and regression with fixed or random design are in [71, 65, 18, 5, 6, 15]. 

These model selection theories are stated for nonparametric scenarios where none of the finite- 
dimensional approximating models is assumed to hold but they are used as suitable sieves to deliver 
good estimators when the size of the sieve is properly chosen (see, e.g., [55, 59, 17] for non-adaptive 
sieve theories). If one makes the assumption that a subset model of at most fc,i terms holds (^q- 
constraint), then the general risk bounds mentioned in the previous paragraph immediately give the 
order fc„ log (M„/fc„) /n for the risk of estimating the target function under quadratic type losses. 

Thus, the literature shows that both £o- a-nd ^i-constraints result in fast rates of convergence 
(provided that A/„ is not too large and A:„ is relatively small), with hard-sparsity directly coming 
from that only a small number of terms is involved in the true model under the ^o-constraint, 
and soft-sparsity originating from the fact that there can only be a few large coefficients under 
the i!i-constraint. In this work, with new approximation error bounds in -hulls (defined in 
section 2.1) for < < 1, from a theoretical standpoint, we will see that model selection or model 
combining with all subset models in fact simultaneously exploits the advantage of sparsity induced 
by ^^-constraints for < q < 1 to the maximum extent possible. 

Clearly, all subset selection is computationally infeasible when the number of terms Af„ is large. 
To overcome this difficulty, an interesting research direction is based on greedy approximation, where 
terms are added one after another sequentially (see, e.g., [12]). Some general theoretical results are 
given in the recent work of [40] , where a theory on function estimation via penalized squared error 
criteria is established and is applicable to several greedy algorithms. The associated risk bounds 
yield optimal rate of convergence for sparse estimation scenarios. For aggregation methods based 
on exponential weighting under fixed design, practical algorithms based on Monte Carlo methods 
have been given in [27, 54]. 

Considerable recent research has focused on £i-regularization, producing efficient algorithms and 
related theories. Interests are both on risk of regression estimation and on variable selection. Some 
estimation risk bounds are in [13, 37, 43, 44, 50, 51, 62, 60, 76, 75, 77, 73]. 

The ^g-constraint, despite being non-convex for Q < q < 1, poses an easier optimization challenge 
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than the i!o-constraint, which is known to define a NP-hard optimization problem and be hardly 
tractable for large dimensions. Although a few studies have devoted to the algorithmic developments 
of the £q-constraint optimization problem, such as multi-stage convex relaxation algorithm ([78]) 
and the DC programming approach ([33]), little work has been done with respect to the theoretical 
analysis of the £q-constrained framework. 

Sparse model estimation by imposing the £g-constraint has found consensus among academics and 
practitioners in many application fields, among which, just to mention a few, compressed sensing, 
signal and image compression, gene-expression, cryptography and recovery of loss data. The iq- 
constraints do not only promote sparsity but also are often approximately satisfied on natural 
classes of signal and images, such as the bounded variation model for images and the bump algebra 
model for spectra ([29]). 

Our ^q-aggregation risk upper bounds require no assumptions on dependence of the initial esti- 
mates in the dictionary and the true regression function is arbitrary (except that it has a known 
sup- norm upper bound in the random design case). The results readily give minimax rate optimal 
estimators for a regression function that is representablc as a linear combination of the predictors 
subject to ^^-constraints on the linear coefficients. 

Two recent and interesting results are closely related to our work, both under fixed design only. 
Raskutti, Wainwright and Yu [53] derived in-probability minimax rates of convergence for estimating 
the regression functions in -hulls with minimal conditions for the full range oi Q < q < 1. In 
addition, in an informative contrast, they have also handled the quite different problem of estimating 
the coefficients under necessarily much stronger conditions. RigoUet and Tsybakov [54] nicely showed 
that exponential mixing of least squares estimators by an algorithm of Leung and Barron [46] over 
subset models achieves universal aggregation of five different types of aggregation, which involve ^q- 
and/or £i-constraints. Furthermore, they implemented a MCMC based algorithm with favorable 
numerical results. As will be seen, in this context of regression under fixed design, our theoretical 
results are broader with improvements in several different ways. 

Our theoretical work emphasizes adaptive minimax estimation under the mean squared risk. 



Z.Wang, S.Paterlini, F. Gao and Y.Yang/ Adaptive Minimax Estimation over Sparse tq-HuUs 



7 



Building upon effective estimators and powerful risk bounds for model selection or aggregation 
for adaptation, we propose several aggregation/combining strategies and derive the corresponding 
oracle inequalities or index of resolvability bounds. Upper bounds for ^^-aggregations and for linear 
regression with ^^-constraints are then readily obtained by evaluating the index of resolvability 
for the specific situations, incorporating an approximation error result that follows from a new 
and precise metric entropy calculation on function classes of -hulls. Minimax lower bounds 
that match the upper rates are also provided in this work. Whatever the relationships between 
the dictionary size Af„, the sample size n, and upper bounds on the ^^-constraints, our estimators 
automatically take advantage of the best sparse ^g-representation of the regression function in a 
proper sense. 

By using classical model selection theory, we have a simple explanation of the minimax rates, 
by considering the effective model size to* , which provides the best possible trade-off between the 
approximation error, the estimation error, and the additional price due to searching over not pre- 
ordered terms. The optimal rate of risk regret for £q-aggregation, under either hard or soft sparsity 
(or both together), can then be unifyingly expressed as 



which can then be interpreted as the log number of models of size to* divided by the sample size 
(Al), as was previously suggested for the hard sparsity case q — Q (e.g.. Theorem 1 of [71], Theorems 
1 and 4 of [65]). 

The paper is organized as follows. In section 2, we introduce notation and some preliminaries 
of the estimators and aggregation algorithms that will be used in our strategies. In addition, we 
derive metric entropy and approximation error bounds for ^^^j" -hulls that play an important role 
in determining the minimax rate of convergence and adaptation. In section 3, we derive optimal 
rates of ^^-aggregation and show that our methods achieve multi-directional aggregation. We also 
briefly talk about ^g-combination of procedures. In section 4, we derive the minimax rate for linear 
regression with ^g-constrained coefficients also under random design. In section 5, we handle ^g- 
regression/aggregation under fixed design with known or unknown variance. A discussion is then 




n 
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reported in section 6. In section 7, oracle inequalities are given for the random design. Proofs of 
the results are provided in section 8. We note that some upper and lower bounds in the last two 
sections may be of independent interest. 

2. Preliminaries 

Consider the regression problem where a dictionary of Af„ prediction functions (A/„ > 2 unless 
stated otherwise) are given as initial estimates of the unknown true regression function. The goal is 
to construct a linearly combined estimator using these estimates to pursue the performance of the 
best (possibly constrained) linear combinations. A learning strategy with two building blocks will 
be considered. First, we construct candidate estimators from subsets of the given estimates. Second, 
we aggregate the candidate estimators using aggregation algorithms or model selection methods to 
obtain the final estimator. 

2.1. Notation and definition 

Let (Xi, Fi), . . . , (X„, y„) be 71 (n > 2) i.i.d. observations where X; — {Xi,i, . . . , Xi,d), 1 < i < n, 
take values in A" C M** with a probability distribution Px ■ We assume the regression model 

= /o(X,)+e„ i = l,...n, (2.1) 

where /o is the unknown true regression function to be estimated. The random errors et^ 1 < i < n, 
are independent of each other and of X^, and have the probability density function h{x) (with 
respect to the Lebesgue measure or a general measure /i) such that E{ei) = and E{ef) = < oo. 
The quality of estimating /g by using the estimator / is measured by the squared L2 risk (with 
respect to Px) 

R{f; /o; n) = E\\f - fof~ = E (^j (/ - hfdPx^ , 
where, as in the rest of the paper, j| • j| denotes the i2-norm with respect to the distribution of Px- 
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Let Fn ~ {/i, /2, . . . , /j\/„} be a dictionary of Mn initial estimates of /q. In this paper, unless 
stated otherwise, \\fj\\<l,l<j< Mn- Consider the constrained linear combinations of the esti- 
mates J- = ^fe = J2^=i ^jfj ■ ^ ^ ^n, fj <= Fn^, where 8,i is a subset of K^^". The problem of con- 
structing an estimator / that pursues the best performance in J-' is called aggregation of estimates. 
We consider aggregation of estimates with sparsity constraints on 9. For any 6 ~ {9i,. . . ,9m^)' , 
define the iQ-novm and the £q-norm (0 < g < 1) by 

\\B% = Y^I(6,^Q), and 

where /(•) is the indicator function. Note that for < < 1, || • is not a norm but a quasinorm, 
and for g = 0, || • ||o is not even a quasinorm. But we choose to refer them as norms for ease of 
exposition. For any < g < 1 and t„ > 0, define the £q-ball 

B,(t„;M„) = {0 = (0i,02,...,eA/J' : \\0%<tn}. 
When q = Q, tn '\& understood to be an integer between 1 and A'/„, and sometimes denoted by 




kn to be distinguished from t„ when q > Q. Define the -hull of Fn to be the class of linear 



combinations of functions in Fn with the ^g-constraint 

:Fq{tn) - -F,(t„; Mn\Fn) ^^Ie = ^^ 0,f, : e S,(t„; M„), /j e F„ | , < g < 1, i„ > 0. 

One of our goals is to propose an estimator fp^ ~ X^j^Ti ^jfj such that its risk is upper bounded 
by a multiple of the smallest risk over the class Fq (t„ ) plus a small risk regret term 

RifF^-Jo; n)<C inf Wfe - fof + i?^G,(<„; M„), 

where C is a constant that does not depend on /o, n, and Af„, or C = 1 under some conditions. 
We aim to obtain the optimal order of convergence for the risk regret term. 



2.2. Two starting estimators 

A key step of our strategy is the construction of candidate estimators using subsets of the initial es- 
timates. The following two estimators (T- and AC-estimators) were chosen because of the relatively 
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mild assumptions for them to work with respect to the squared L2 risk. Under the data generating 
model (2.1) and i.i.d. observations (Xi, Yi), . . . , (X„, y„), suppose we are given (gi, . . . , gm) terms 
for the regression problem. 

When working on the minimax upper bounds in random design settings, we will always make 
the following assumption on the true regression function. 

Assumption BD: There exists a known constant L > such that ||/o||oo < L < oo. 

(T-estimator) Birgc [15] constructed the T-cstimator and derived its L2 risk bounds under the 
Gaussian regression setting. The following proposition is a simple consequence of Theorem 3 in [15]. 
Suppose 

Tl. The error distribution h{-) is normal; 
T2. < cr < 00 is known. 

Proposition 1. Suppose Assumptions BD and Tl, T2 hold. We can construct a T-estimator f^'^^ 
such that 







m 




E\\f^^^-for<c,,. 


inf 


J = l 


n 1 



where Cl,(t is a constant depending only on L and a. 

(AC-estimator) For our purpose, consider the class of linear combinations with the i!i-constraint 
Q = {g = Yl^=i'^j9i ■ ll^lli — foi' some s > 0. Audibert and Catoni proposed a sophisticated 
AC-cstimator /i'^'^'' ([4], page 25). The following proposition is a direct result from Theorem 4.1 in 
[4] under the following conditions. 

ACl. There exists a constant H > such that sup^ g>f^g ^gA' ~ 9 (^)l = H < 00. 

AC2. There exists a constant a' > such that sup^^g^r^- E' ((F — g*(X))^|X = x) < (cr')^ < 00, 

where g* infggg ||,g - /o||^- 

Proposition 2. Suppose Assumptions ACl and AC2 hold. For any s > 0, we can construct an 
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AC- estimator /i"^'^'' such that 

EWn^""^ /of < inf !l.g - fof + c {2a' + Hf 

see n 

where c is a pure constant. 

Note that under the assumption ||/o||oo < we can always enforce the estimators f^"'"^ and 
/i"^'^^ to be in the range of [— -/j, L] with the same risk bounds in the propositions. 



2.3. Two aggregation algorithms for adaptation 

Suppose N estimates /i , . . . , /jv are obtained from N candidate procedures based on some initial 
data. Two aggregation algorithms, the ARM algorithm (Adaptive Regression by Mixing, Yang [68]) 
and Catoni's algorithm (Catoni [24]), can be used to construct the final estimator / by aggregating 
the candidate estimates /i, . . . , /at based on n additional i.i.d. observations (X^, li)"=i- The ARM 
algorithm requires knowing the form of the error distribution but it allows heavy tail cases. In 
contrast, Catoni's algorithm does not assume any functional form of the error distribution, but 
demands exponential decay of the tail probability. 

(The ARM algorithm) Suppose 
Yl. There exist two known constants g_ and such that 0<a;<CT<CT<oo; 
Y2. The error density function h{x) has a finite fourth moment and for each pair of constants 
i?o > and < 5*0 < 1, there exists a constant -Bso.-Ro (depending on S'o and i?o) such that for all 
\R\ < Ro and So < S < So\ 

I '^'^ S-^hti^-R)/S) '^ ^ «^ - + 

We can construct an estimator which aggregates fi, ■ ■ ■ , Jn by the ARM algorithm as described 
below. 

Step 1. Split the data into two parts Z'^' = (X,,K,)"li, Z^^^ = (X,, yj)^^„^+i. Take ni = \n/2\. 
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Step 2. Estimate for each fk using the data Z^^^ 

-^fe = — E - A-(X.))' , for 1 < fc < iV. 



ni . 
i— 1 



CUp the estimate (t^ into the range [ai^,cr^] if needed. 
Step 3. Evaluate predictions for each k. For rii + 1 < / < n, predict Yi by fk{^i) and compute 



Step 4. Compute the final estimate = Sfc=i W^fe/fc with 



W, = ^— V Wu,i and M^fe, ^'^^'^''^ 



where TTfc are prior probabilities such that X^^i ""fc = 1- 

Proposition 3. (Yang [69], Proposition 1) Suppose Assumptions BD and Yl, Y2 hold, and 
||/fci|oo < i < oo with probability \, \ < k < N . The estimator by the ARM algorithm has 
the risk 

Rir-, fo; n) < Cy inf ^ (\\fk - fof + - f 1 + log 1 

where Cy is a constant that depends on g_,'a,L, and also h (through the fourth moment of the 
random error and BsqMo "with So = ct/ct, Rq = L). 

Remark 1. If cr is known or other estimators of a are available, the data splitting is not required, 
and the ARM algorithm consists of only Steps 3 and 4. 

(Catoni's algorithm) Suppose for some positive constant a < oo, there exist known constants 
Ua ,Va < oo such that 
CI. Eicxp{a\e^\)) < 

P9 E{e^ exp(a|ei|)) , -r. 
£;(exp(a|ei|)) - 



The estimator built using Catoni's algorithm is = J2k=i ^kfk with 



Wk = -Y. ^ / , 9,(y,|x,) = J-^exp -^(y,-/,(x,))n 
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where Ac = min{^, (f7a(17L^ + 3.4V^))^^}, and tt^ is the prior for /fc, 1 < fc < A^, such that 
l^k^i'^k = 1- 

Proposition 4. (Catoni [24], Theorem 3.6.1) Suppose Assumptions BD and CI, C2 hold, and 
||/fe|loo 1^ L < oo, 1 < k < N . The estimator that aggregates /i, . . . , /at by Catoni's algorithm 
has the risk 

/o; n) < inf f II A - fof + log -) . 

Remark 2. In the risk bound above, the muhiplying constant in front of \\fi; — /o||^ is one, which 
can be important sometimes. Catoni [24] provided results under weaker assumptions than CI and 
C2. In particular, £j and Xj do not have to be independent. 

2.4- Metric entropy and sparse approximation error of £^^^ -hulls 

It is well-known that the metric entropy plays a fundamental role in determining minimax-rates of 
convergence, as shown, e.g., in [14, 72]. 

For each 1 < m < M„ and each subset J™ C {1,2,..., M„} of size m, define J^j^ ~ {J2je.rm ^J-fj ■ 
e e M"' }. Let 

d\fo;T)= ini life- for 
denote the smallest approximation error to /o over a function class J-. 

Theorem 1. (Metric entropy and sparse approximation bound for i^^^^-hulls) Suppose Fn = 
{/ij/2, ■•■i/a/„} with \\fj\\L^[i^) < 1. 1 < < Mn, where v is a a-finite measure. 

(i) For < (7 < 1, there exists a positive constant Cq depending only on q, such that for any 
< e < tn, Fq{tn) Contains an e-net {ej}^!^ in the L2{v) distance with \\ej\\o < 5(<„e~"'^)^'^^^~'^' + 1 
for j ~ l,2,...,iVe, where TV^ satisfies 

^{ Cq (i„e-i)^ log(l + M^'H-'e) if e > t,MrK 
\ogN, < < (2.2) 

[ c,M„ log(l + 't„e-i) ife<tnM^ \ 

(ii) For any 1 < m < Mn, < q < 1, tn > 0, there exists a subset Jm and fgm e with 



Z.Wang, S.Paterlini, F. Gao and Y.Yang/ Adaptive Minimax Estimation over Sparse tq-HuUs 



14 



11^™ 111 < tn such that the sparse approximation error is upper bounded as follows 



ll/e' 



(2.3) 



The metric entropy estimate (2.2) is the best possible. Indeed, if /j, 1 < j < Af„, are orthonormal 
functions, then (2.2) is sharp in order for any e satisfying that t/tn is bounded away from 1 (see 
[45]). Also note that if we let vt^ be the discrete measure ^ X^ILi '^^i' 'where Xi, X2, x„ are fixed 
points in a fixed design, then ||.g||L2(,yo) = {n^l=i l.9(^«)P)"^^^- Thus, part (i) of Theorem 1 implies 



Lemma 3 of [53], with an improvement of a log(Afn) factor when e w tnMn ' , and an improvement 
from (t„e^^)2^ log(M„) to M„log(l + Mn ^i„e~^) when e < tnMn ' . These improvements are 
useful to derive the exact minimax rates for some of the possible situations in terms of Af„, q, and 

■ 

With the tools provided in Yang and Barron [72], given fixed q and one can derive minimax 
rates of convergence for ^^-aggregation problems and also for linear regression with ^^-constraints. 
However, the goal for this work is to obtain adaptive estimators that simultaneously work for J-q{tn) 
with any choice oiQ < q <1 and tn, and more. 

2.5. An insight from the sparse approximation hound based on classical model 
selection theory 

Consider general M„, t„ and < 5 < 1. With the approximation error bound in Theorem 1, classical 
model selection theory can provide key insight on what to expect regarding the minimax rate of 
convergence for estimating a function in f^^/^-huU. 

Suppose J„i is the best subset model of size m in terms of having the smallest L2 approximation 
error to /q. Then the estimator based on is expected to have the risk (under some squared error 
loss) of order 



Minimizing this bound over m, we get the best choice (in order) in the range 1 < m < Af„ A n : 



n 
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where r = cr~^ is the precision parameter. When q = with t„ = fc„, m* should be taken to be 

kn A n. It is the ideal model size (in order) under the ^g-constraint because it provides the best 

possible trade-off between the approximation error and estimation error when 1 < m < Af„ A n. 

The ratio m* /Mn is called a sparsity index in [71] (section III.D) that characterizes, up to a log 

factor, how much sparse estimation by model selection improves the estimation accuracy based 

on nested models only. The calculation of balancing the approximation error and the estimation 

error is well-known to lead to the minimax rate of convergence for general full approximation sets 

of functions with pre-determincd order of the terms in an approximation system (sec section 4 of 

[72]). However, when the terms are not pre-ordered, there are many models of the same size to*, 

and one must pay a price for dealing with exponentially many or more models (see, e.g., section 5 of 

[72]). The classical model selection theory that deals with searching over a large number of models 

tells us that the price of searching over (;|^") many models is the addition of the term log (f^j) /n 

(e.g., [10, 71, 11, 65, 18, 6]). That is, the risk (under squared error type of loss) of the estimator 

based on subset selection with a model descriptive complexity term of order log (^,^") added to the 

AlC-type of criteria is typically upper bounded in order by the smallest value of 

. . . . ^a^m aHo^Ch^ 
(squared) approximation error^ H h 



n n 



over all the subset models, which is called the index of the resolvability of the function to be 
estimated. Note that ^ -I- ^ is uniformly of order to (l + log (■^)) /n over < < Mn- 

Evaluating the above bound at to* in our context yields a quite sensible rate of convergence. Note 
also that log (price of searching) is of a higher order than ^ (price of estimation) when 

TO* < M„/2. Define 

SER{m) ^ 1 + log f ^"l X !!i±MS, 1 < ^ < M„, 
\ m J TO 

to be the ratio of the price with searching to that without searching (i.e., only the price of estimation 
of the parameters in the model). Here "x" means of the same order as n oo. Observe that 
reducing to* slightly will reduce the order of searching price ™ SER{m ) ^gjj^(,g 2;(1 -|- log (M„/x)) 
is an increasing function for < a; < Mn) and increase the order of the squared bias plus variance 
(i.e., t^m^^^/'^ + ^—^). The best choice will typically make the approximation error t^m}^'^/'' of 
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the same order as ™ ^ _ Define 









(1+log ^ j 




SS_R(m*)9/2 



if m* = M„ A n, 
otherwise. 



We call this the effective model size (in order) under the £q-constraint because evaluating the 
index of resolvability expression from our oracle inequality at the best model of this size gives the 
minimax rate of convergence, as will be seen. When m* ~ n, the minimax risk is of order 1 (or 
higher sometimes) and thus does not converge. Note that the down-sizin g factor SER{m*)i/'^ from 
TO* to TO* depends on q: it becomes more severe as q increases; when 9 = 1, the down-sizing factor 

1/2 

reaches the order (l + log (■^)) • Since the risk of the ideal model and that by a good model 
selection rule differ only by a factor of log(Af„/TO*), as long as M„ is not too large, the price of 
searching over many models of the same size is small, which is a fact well known in the model 
selection literature (see, e.g., [71], section III.D). 

For (7 = 0, under the assumption of at most fc„ < Af„An nonzero terms in the linear representation 
of the true regression function, the risk bound immediately yields the rate ( 1 + log (Y") ) /n >; 



''" ■ Thus, from all above, we expect that '^'^^^('^»^ A 1 is the unifying optimal rate of 
convergence for regression under the £g-constraint for < q < 1- 

The aforementioned rates of convergence for estimating functions in f*^*^ -hulls for < 9 < 1 
will be confirmed, and our estimators will achieve the rates adaptivcly in some generality. From the 
insight gained above, to construct a multi-directional (or universal) aggregation method that works 
for alio < q < 1, it suffices to aggregate the estimates from the subset models for adaptation, which 
will automatically lead to simultaneous optimal performance in -hulls. 



3. ^q-aggregation of estimates 

Consider the setup from section 2.1. We focus on the problem of aggregating the estimates in F„ 
to pursue the best performance in Tq{tn) for < q < 1, i„ > 0, which we call (.q-aggregation of 
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estimates. To be more precise, when needed, it will be called £q(i„)-aggregation, and for the special 
case of g = 0, we call it ^o(fcra)-aggrcgation for 1 <kn < Mn- 

3.1. The strategy 

For each 1 < m < M„ A n and each subset model Jm C {1,2,..., A/„} of size m, let Fj^ be as 
defined in section 2.4, and let ^j^^ ^ = {/e = Y^jf^j^ Ojfj ■ \\0\\i < s, ||/e||oo < L} be the class of 
^i-constrained linear combinations in Fn with a sup- norm bound on fg. Our strategy is as follows. 

Step I. Divide the data into two parts: Z^^' = (Xi,y,)I'ii and Z^^) ^ (X,, ri)f^„^+i. 
Step II. Based on data Z^^\ obtain a T-cstimator for each function class or obtain an AC- 

cstimator for each combination of s G N and function class Fj^ ^. 
Step III. Based on data Z^^\ combine all estimators obtained in step II and the null model (/ = 0) 
using Catoni's or the ARM algorithm. Let pq be a small positive number in (0, 1). In all, we 
have to combine X]m=i" (^m") T-estimators with the weight ttj,,^ = (1— _po) (^{Mn A 
and the null model with the weight ttq ~ po, or combine countably many AC-estimators with 
the weight ttj^,s = (1 ^ Po) ({^ + s)'^{Mn A "■)(*^")^ and the null model with the weight 
I'D ~ Po- (Note that sub-probabilities on the models do not affect the validity of the risk 
bounds to be given.) 

For simplicity of exposition, from now on and when relevant, we assume n is even and choose 
ni = n/2 in our strategy. However, similar results hold for other values of n and ni. 

We use the expression "E-G strategy" for ease of presentation where E = T or AC represents 
the estimators constructed in Step II, and G = C or Y stands for the aggregation algorithm 
used in Step III. By our construction. Assumption ACl is automatically satisfied: for each Jm, 
Hj^,s = suP/,/'ejr^ ,xeAr ^ ^ 2L. Assumption AC2 is met with (cr')^ = cr^ + 4L^. 
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We assume the following conditions are satisfied for each strategy, respectively. 

At-c and At-y : BD, Tl, T2. 

Aac-c : BD, CI, C2. 
Aac-y : BD, Yl, Y2. 

Given that Tl, T2 arc stronger than CI, C2 and Yl, Y2, it is enough to require their satisfaction 
in At-c and At-y- 



3.2. Minimax rates for £q- aggregation of estimates 

Let F^{tn) = J^q{tn) iH {/ : ll/lloo < i} for < g < 1. In the previous section, we have defined 
— m^,{q,tn) to be the effective model size for < q < 1. Now, for ease of presentation, we 
extend the definition to 



m^{q,tn) for case 1: J" = J'q(t„),0 < q<l, 

kn An for case 2: T = J-'Q{kn), 

?7i*{q, tn) A kn for case i: F = J-q{tn) H J^o{kn), < q < 1. 
Note that in the third case, we are simply taking the smaller one between the effective model 
sizes from the soft sparsity constraint (£q-constraint with < q < I) and the hard sparsity one 
(£o-constraint) , and this smaller size defines the final sparsity. Define 

REGim^) = ctM 1 A ^ 

V 

which will be shown to be typically the optimal rate of the risk regret for £g-aggregation. In partic- 
ular. Theorems 2 and 3 provide upper and lower bounds to determine the order of the risk regret 
for ^g-aggregation of estimates. The specific behaviors of REG{m-^) for the three different cases 
will be precisely discussed later. 

For case 3, we intend to achieve the best performance of linear combinations when both ^o- and 
^^-constraints are imposed on the linear coefficients, which results in i!g-aggregation using just a 
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subset of the initial estimates and will be called (.q n ^^-aggregation. For the special case oi q = 1, 
this Iq n ^i-aggrcgation is studied in Yang [69] (page 36) for multi-directional aggregation and in 
Lounici [49] (called D-convex aggregation) more formally, giving also lower bounds. Our results 
below not only handle q < 1 but also close a gap of a logarithmic factor in upper and lower bounds 
in [49]. 

For ease of presentation, we may use the same symbol (e.g., C) to denote possibly different 
constants of the same nature. 

Theorem 2. Suppose Ae-g holds for the E-G strategy respectively. Our estimator fp^ simulta- 
neously has the following properties. 

(i) For T- strategies, for T = Tq(tn) with < q < 1, or T = J-'o{kn), or F ~ J-q(tn) H J-o{kn) 
with < q < 1, we have 



RifF„;fo;n) < [Cod^ifo;T) + CiREG{mf)] A 



co(ii/oirv^ 



(a) For AC- strategies, for T = J-^q{tn) with < q < 1, or F = Fa(kn), or T ^ J^qitn) H Toikn) 
with < q < 1, we have 

R{fF„-Jo;n) < CiREG{mf) + 

d^ fo; TqHta)) + ^^"'^°f+*"^ for case 1, 

mis>i (inf{9:||0||i<,s,||9||o<fe„,||/8||^<L} - fo\? + ^i£-iMl±£)^ for case 2, 

d^fo; F^{tn) n H{kn)) + ^^"''°s(i+*") for case 3. 



Cni 



Also, R{fF,^;fo;n) < Co {\\fof V ^ 



For all these cases, Cq and C2 do not depend on n, f(),tn,q,kn, Mn; Ci does not depend on 
n, fo,tn,kn, Mn. Thcsc constants may depend on L, pq, orW'^/q^, a, Ua,Va when relevant. An 
exception is that Cq = 1 for the AC-C strategy. 

Remark 3. When q ~ 1, our theorem covers some important previous aggregation results. With 
t„ = 1 , Juditsky and Nemirovski [42] obtained the optimal result for large M„ ; Yang [69] gave upper 
bounds for all M„, but the rate is slightly sub-optimal (by a logarithmic factor) when M„ = 0{y/n) 
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and with a factor larger than 1 in front of the approximation error; Tsybakov [58] presented the 
optimal rate for both large and small Af„, but under the assumption that the joint distribution of 
{/j(X), j — 1, A/„} is known. For the case M„ = ©(-^n), Audibert and Catoni [4] have improved 
over [69] and [58] by giving an optimal risk bound. Even when 5 = 1, our result is more general in 
that tn is allowed to be arbitrary. Note also that in some specific cases, the induced sparsity with 
^i-constraint was explored earlier in e.g., [30, 9, 71]. The latter two papers dealt with nonparametric 
situations with mild assumptions on the terms in the approximation systems. In particular, when 
the true function has a finite-order linear expression, the estimators achieve the minimax optimal 
rate (log n) jn when A'f„ grows polynomially fast in n. 

Remark 4. The upper rate for q ~ as well as its interpretation is not new in the literature (see, 
e.g.. Theorem 1 of [71], Theorems 1 and 4 of [65]): by noticing that there are (^^") subsets of size kn 
and that log (^^") < kn (1 + log(M,i/fc„)), the rate for q ~ 0, which directly imposes hard sparsity 
on the maximum number of relevant terms, is just the log number of models of size /s„ divided by 
the sample size. 

Remark 5. Note that an extra term of log(l + <„)/7i is present in the upper bounds of the estimator 
obtained by AC- strategies. For case 1, if t„ < A eC)ri.(i+iog(A/„/m,)) j-^j, ^ pure constant c, 
then log(l + tn)/n is upper bounded by a multiple of i?i?G(?7i^''^*"''). Then, under the condition 
that the approximation errors involved in the risk bounds are of the same order, AC- strategies 
have the same upper bound orders as T- strategies. For case 2, the same is true if for some s < 
^cn ^ gCfe„(i+iog(M„/A:„))^ -j-j^g £^ norm constraint does not enlarge the approximation error order. 

Remark 6. For case 2, the boundcdness assumption of ||/, || < 1, 1 < J < ^J^n is not necessary. 
Remark 7. If the true function /o happens to have a small L2 norm such that II /o IP V — is of 
a smaller order than REG{m^)^ then its inclusion in the risk bounds may improve the rate of 
convergence. 

Next, we show that the upper rates in Theorem 2 cannot be generally improved by giving a 
theorem stating that the lower bounds of the risk are of the same order in some situations, as is 
typically done in the literature on aggregation of estimates. The following theorem implies that 
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the estimator by our strategies is indeed minimax adaptive for ^^-aggregation of estimates. Let 
/i, . . . , /m„ be an orthonormal basis with respect to the distribution of X. Since the earher upper 
bounds are obtained under the assumption that the true regression function /g satisfies ||/o||oo 5: L 
for some known (possibly large) constant L > 0, for our lower bound result below, this assumption 
will also be considered. For the last result in part (iii) below under the sup- norm constraint on /o, 
the functions /i, . . . , are specially constructed on [0, 1] and Px is the uniform distribution on 
[0, 1]. See the proof for details. 

In order to give minimax lower bounds without any norm assumption on /o, let fh^ be defined 
the same as except that the ceiling of n is removed. Define 



^ o-^"if • 1 + log ( ^ ) ) \ tl for cases 1 and 3, 



oo for case 2, 

tf^ for cases 1 and 3, 
oo for case 2. 

Theorem 3. Suppose the noise e follows a normal distribution with mean and variance cr^ > 0. 



REG{mf) = REG{mf) A 



(i) For any aggregated estimator fp^ based on an orthonormal dictionary F„ = {fi, - ■ . , /j\/„}, for 

J- ~ J-q{tn), or J- = J-'o(fc„), or J- ~ J'q{tn)r\J-o{kn) with < q < 1, one can find a regression 
function /g ( that may depend on J-) such that 

R{fF^;fo]n) - d\fo;T) > C -REGimf), 

where C may depend on q (and only q) for cases 1 and 3 and is an absolute constant for case 
2. 

(ii) Under the additional assumption that ||/o|| < L for a known L > 0, the above lower bound 

becomes C ■ REG (m^) for the three cases, where C may depend on q and and L for cases 1 
and 3 and on L for case 2. 

(iii) With the additional knowledge ||/o||oo < L for a known L > 0, the lower bound C ■ REG (m^) 
also holds for the following situations: l)forT = J^q{tn) withO <<?<!, i/supj^g^r^jj^j 1 1, /e I loo < 
L; 2) for F = J-o{kn), if supi^j^j^j^^ IL/jlloo < i < oo and + log-|^) are bounded above; 
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3) for T = J-'o{kn), if Mn/ ^1 + log < bn for some constant b > and the orthonormal 
basis is specially chosen. 

For satisfaction of supj^gjr^^j^j II /el loo < L, consider uniformly bounded functions fj, then for 
0<<Z< 1, 

Il5^^,/,||oo<^|e,|||/,||co< f sup ||/,||oo) ll^lll < f sup ||/,|U)||0||,. 

Thus, under the condition that (sup2<j<A/^^ ||/j||oo)in is upper bounded, supj^gjr^j^^-) H/elloo < L 
is met. 

The lower bounds given in part (iii) of the theorem for the three cases of £g-aggregation of 
estimates are of the same order of the upper bounds in the previous theorem, respectively, unless i„ 
is too small. Hence, under the given conditions, the minimax rates for ^^-aggregation are identified. 
When no restriction is imposed on the norm of /o, the lower bounds can certainly approach infinity 
(e.g., when t„ is really large). That is why REG{rh^) is introduced. The same can be said for later 
lower bounds. 

For the new case < g < 1 , the f^-constraint imposes a type of soft-sparsity more stringent than 
(7=1: even more coefficients in the linear expression are pretty much negligible. For the discussion 
below, assume m* < n. When the radius t„ increases or g — !■ 1, m* increases given that the ^g-ball 
enlarges. When = m* = Mn < n, the £q-constraint is not tight enough to impose sparsity: 
£q-aggregation is then simply equivalent to linear aggregation and the risk regret term corresponds 
to the estimation price of the full model, Af„cr^/n. In contrast, when 1 < m^, < Mn A n, the rate 
for ^^-aggregation can be expressed in different ways: 

1-9/2 

X —SER{m,) X —SER{m*) x —SER{m*f-^ . 
n n n 

The second expression is transparent in interpretation: due to the sparsity condition, we only need 
to consider models of the effective size to* and the risk goes with the searching price ^SER (m*) 
(the estimation error of parameters is being dominated in order). The last expression means 
that we can do better than searching over the models of the ideal model size to* , which has the 
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risk ^SER (m*) . The minimax risk is deflated by a factor of SER{m*)^ , which becomes larger 
as q 1, pointing out that the factor SER{m*) has to be downsized more as the ^g-ball becomes 
larger. When m* — Mn (the full model), SER{m*) reduces to 1. When m* < (1 + log(A/„/m*))'^^ 
or equivalently = 1 , the £q-constraint restricts the search space of the optimization problem so 
much that it suffices to consider at most one fj and the null model may provide a better risk. 

Now let us explain that our ^^-aggregation includes the commonly studied aggregation problems 
in the literature. First, when q = 1, we have the well-known convex or £i-aggregation (but now 
with the ^i-norm bound allowed to be general). Second, when 9 = 0, with fc„ = A/„ < n, we have 
the linear aggregation. For other fc„ < A/„ A n, we have the aggregation to achieve the best linear 
performance of only fc„ initial estimates. The case g = and fc„ = 1 has a special implication. 
Observe that from Theorem 2, we deduce that for both the T- strategies and AC- strategies, under 
the assumption sup^ 1 1 /j I loo L, our estimator satisfies 



where Cq = 1 for the AC-C strategy. Together with the lower bound of the order (l A - 



on the risk regret of aggregation for adaptation given in [58], we conclude that £o(l)-£^ggregation 
directly implies the aggregation for adaptation (model selection aggregation). As mentioned earlier. 
^o(^n)n£q(<„)-aggregation pursues the best performance of the linear combination of at most kn ini- 
tial estimates with coefficients satisfying the £q-constraint, which includes the D-convex aggregation 
as a special case (with q ~ 1). 

3.3. £q- combination of procedures 

Suppose we start with a collection of estimation procedures A = {5i, . . . ,(5m„} instead of a dic- 
tionary of estimates. Let fj be the estimator of the unknown true regression function based on 
the procedure Sj, 1 < j < Af„, at a certain sample size. Our goal is to combine the estimators 
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{I 3 : 1 < j < Mn] to achieve the best performance in 

j-,(t„; A) = |/e = ^ e,f, : \\e\u < i„| , < g < 1, i„ > 0. 

Wc spht the data (Xi, Fi), . . . , (X„, y„) into three parts: Z^^' = (X,, Z'^) = (X^, 

and Z'^^'i — (Xi, i^i)"=rii+n2+i' ^^'^ ^^'^ data Z*^^^ to obtain estimators /i, . . . , /m„ and use the data 
Z*^^) to construct T-estimators or AC-estimators based on subsets of /i, . . . , . The data Z*^"^) are 
used to construct the final estimator /a by aggregating the T-estimators or AC-estimators and the 
null model using Catoni's or the ARM algorithm as done in the previous section. For simplicity, 
assume n is a multiple of 4 and choose ni = n/2, 712 = n/4. Upper boimds for combining procedures 
by our strategy are obtained similarly. The only difference is that d^(/o; J^) is replaced by the risk 
of the best constrained linear combination of the estimators /i^„/2, ■ • ■ , /A/„.n/2j where we add the 
second subscript n/2 to emphasize that the estimators are constructed with a reduced sample size. 
For example, by T- strategies, we have that for any < g < 1 and t„ > 0, 

2 



i?(/A;/o;n)<Co inf E 



Ci •i?£;G(mf'^*"^), 



and again such risk bounds simultaneously hold for Q < q < 1 and i„ > 0. 

Note that these risk bounds involve the accuracies of the candidate procedures at a reduced 
sample size n/2 due to data splitting to come up with the estimates to be aggregated. Ideally, we 
want to have Co = 1 and ,fj,n/2 replaced by At this time, we are unaware of any such risk bound 
that holds for combining general estimators (in fixed design case, Leung and Barron's algorithm 
does not involve data splitting, but it works only for least squares estimators). Because of this, 
the theoretical attractiveness that the constant Co being 1 in the aggregation stage, unfortunately, 
disappears since the remaining parts in the risk bounds also depend on the data splitting and there 
seems to be no reason to expect with certainty that an aggregation method with Co = 1 has a 
better risk, even asymptotically, than another one with Co > 1. Therefore, for combining general 
statistical procedures, it is unclear how useful Co = 1 is even from a theoretical perspective. (It 
seems that there is one scenario that one can argue otherwise: the candidate estimates are truly 
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provided. In the application of combining forecasts sequentially, the candidate forecasts may be 
provided by other experts/commercial companies and the statistician does not have access to the 
data based on which the forecasts are built. In this context, since no data splitting is needed, Cq ~ 1 
leads to a theoretical advantage compared to Co > 1.) For this reason, in our view, results with 
Co > 1 (but not too large) are also important for combining procedures. Indeed, such results often 
have strengths in other aspects such as allowing heavy tail distributions for the errors and allowing 
dependence of the observations. 

Nonetheless, regardless of the degree of practical relevance, limiting attention to the aggregation 
step and pursuing Co = 1 in that local goal is certainly not without a theoretical appeal. 

Some additional interesting results on combining procedures are in [3, 15, 20, 26, 27, 35, 36, 39, 
38, 63, 68]. 



4. Linear regression with £q-constrained coefRcients under random design 

Let's consider the linear regression model with Af„ predictors Xi, . . . ,Xm„. Suppose the data are 
drawn i.i.d. from the following model 

y = /o(X)+£ = ^0jXj+e. (4.1) 

As previously defined, for a function /(xi, . . . ,XMn) : ^ R, the i2-iiorm ||/|| is the square root 
of Ef^{Xi, . . . ,Xm„), where the expectation is taken with respect to Px, the distribution of X. 
Denote the -hull in this context by 

^g(t„; M„) = |/e = ^ djXj : ll^llg < t„| , Q<q<l, t„ > 0. 

For linear regression, we assume coefficients of the true regression function /o have a sparse £q- 
representation (0 < < 1) or ^o-representation or both, i.e. /o G where J- = J-'q(tn; Mn), 
To{kn; Mn) or J"g(t„; A/„) f| J"o(fc«; Mn). 
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Assumptions BD and Ae-g are still relevant in this section. As in the previous section, for 
AC-estimators. wc consider ii- and sup-norm constraints. 

For each 1 < m < Af„ A n and each subset Jm of size to, let Qj^^^ = iX^je/ ^j^j ■ ^ ^ K™} and 
^Jm s " {J2jeJ ^j^i ■ ll^lli — ll/slloo — -^i- introduce now the adaptive estimator /^, built 
with the same strategy used to construct /f„ except that we now consider Qj^ and Q^^^ instead 
of and ^. 



Upper bounds 



We give upper bounds for the risk of our estimator assuming /□ G J-^{tn;Mn), J-(^(A:„; Af„), or 
T^{tn]Mn) n J^Q [kn] Mn) , whcrc J^^ = {/ : / e -7^, ll/llcx) < L,} for a positive constant L. Let 
Q!„ = supjrgjri,^^.^.^/ J inf{||0j|i : /g = /} be the maximum smallest £i-norm needed to represent the 
functions in /"(^(/cn; Af„). For case of presentation, define ^f-^ as follows: 



^.F,^(t,.;M„) ^ J 



a 

n 



1-9/2 



if = n, 
if TO* = M„ < n, 

A cr^ if 1 < TO* < Af„ A n, 



i„ V ^ ) A 



if; 



1, 



/ fc„(l + logf^; 
0-M 1 A ^ ^ 



In addition, for lower bound results, let _3f-^<f (*"'^^") (0 < q < 1) and ^•^^(*'-^^")n-^o''('^-";A^") (0 < q < 
1) be the same as and v[/-^5^(*";^^")n-7^o^(fc";A/„)^ respectively, except that when < q < 1 

and TO* = 1, ^-^^ takes the value Atl instead of cr^ A (t^ V ^) and ^•^,''(*-*'f")n^tf 
is modified the same way. 



Theorem 4. Suppose Ae-g /loZrfs /or the E-G strategy respectively, and sup]^<j<jif^ ||A"j||oo < 1- 
T/ie estimator Ja simultaneously has the following properties. 
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(i) For T- strategies, for F = J'^(i„; M„) with < g < 1, or J" = J'^(A:„; M„), or F ^ ^q{tn; Mn)C^ 
F^ikn] Mn) with < g < 1, we have 

sup R{fA-Jo;n) < Ci^^, 

where the constant Ci does not depend on n. 
(ii) For AC- strategies, for F = F^{tn; Mn) with < q < 1, or F ~ J^g (A:„; M„), or F ~ F^{tn] Af„)n 
Fglkn', Mn) with < q < 1, we have 

( a- log(l+a„) forF^Fl^ikn;Mn), 
sup R{fA-Jo;n)<C,^^ + cJ , " 
fo^^ " '°g(i+*") otherwise, 

where the constants Ci and C2 do not depend on n. 

Remark 8. The constants Ci and C2 niay depend on L, po, a^, o'^/ct^, a, Ua, Va when relevant. 

Remark 9. The rate I -^^^ ) for < g < 1 has appeared in related regression or normal mean 
problems, e.g., in [30] (Theorem 3), [72] (section 5), [40] (section 6), and [41]. For function classes 
defined in terms of infinite order orthonormal expansion with bounded g-norm of the coefficients 
and with €2-norm of the tail coefficients decaying at a polynomial order, the rate of convergence 
(log Ti/n)^^''/^ is derived in [71] (page 1588) (when the tail of the coefficients decays fast, the rate 
is improved to (l/n)-^"''/^). Note that only the upper rates arc given there. 



4-2. Lower bounds 

To derive lower bounds, we make the following near orthogonality assumption on sparse sub- 
collections of the predictors. Such an assumption, similar to the sparse Riesz condition (SRC) 
(Zhang [78]) under fixed design, is used only for lower bounds but not for upper bounds. 

Assumption SRC: For some 7 > 0, there exit two positive constants a and a that do not depend 
on n such that for every 6 with \\0\\q < min(27, A/„) we have 

«||^^ll2<||M|<a||0||2. 
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Theorem 5. Suppose the noise e follows a normal distribution with mean and variance < < 



oo. 



^/c„(l+log 




■) 




//c^(l+log 




■) 



(i) For < q < 1, under Assumption SRC with 7 m*, we have 

inf sup E\\f - /o||2 > c*^^(*-*^"). 
/ /oe:^,{t,.;M„) 

(ii) Under Assumption SRC with 7 = fc„, we have 



inf sup E\\f-fof>c 

/ /oe^o(fc„;A/„)n{/.:P||.<a„} tf a„ < da 

where c is a pure constant. 
(Hi) For any < q < 1, under Assumption SRC with "f ~ kn A m*, we have 

inf sup E\\f - /olP > c"^^"^'--''"^''^o^'-^'-\ 

f /oe^o(fc„;M„)nJ^,(t„;A/,.) 

For all cases, f is over all estimators and the constants c, c and c may depend on a, a, q and 

„2 



Remark 10. Note that in (i), at the transition from > 1 to m.^, = 1, i.e., nt'^T w 1 + log ■ 



(nt2 t)<!/2 ' 

we see continuity: 



^ '"1^ n J n 

For the second case (ii), the lower bound is stated in a more informative way because the effect 
of the boimd on \\9\\2 is clearly seen. Normality of the errors is not essential at all for the lower 
bounds. With some additional efforts, one can show that these lower rates are also valid under 
Assumption Y2, which we will not give here. 

4-3. The minimax rates of convergence 

Combining the upper and lower bounds, we give a representative minimax rate result with the roles 
of the key quantities n, M„, 5, and fc„ explicitly seen in the rate expressions. Below "x" means of 
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the same order when L, Lq, q, tn — t, and ( is defined in Theorem 6 below) are held constant 
in the relevant expressions. 

Theorem 6. Suppose the noise e follows a normal distribution with mean and variance , and 
there exists a known constant W such that < a < a < oo. Also assume there exists a known 
constant Lq > such that supj<^<^j^ ||Xj||oo < < oo- 

(i) For < q < 1, under Assumption SRC with 7 = m*, 



1 if TO*= n, 



inf sup E\\f-fof-lA{ 



Ma. 

n 

1-9/2 



if m^ = Mn < n, 

/ i+ioo- ihi \ -^-9/ ^ 



^2 /']^_|_jQg Mil') 

(a) If there exists a constant Kq > such that — - — < Kq, then under Assumption SRC 
with J = kn, 

A:„(l + logf-) 

inf sup E\\f - foW^ -lA ^ 

/ /oe.F(f (fe„;A/„)n{/8:|ie||o„<Lo} " 

(Hi) // cr > is actually known, then under the condition — - — < Kq and Assumption 
SRC with J = kn, we have 

inf sup Si|/-/o||'xlA ^ - 

f /o6:^^-(fc„;M„) 

and for any < 5 < 1, under Assumption SRC with 7 = fc„ A m.^,, we have 



inf sup S|l/-/o|r X 1 A <^ /i+iog , \ 



Remark 11. When considering jointly the £q-constraint for a fixed < (7 < 1 and q = 0, since 
the associated function classes are not nested, one cannot immediately deduct the optimal rate of 
convergence for their intersection. In our problem, the simple rule works: when the upper bound A:„ 
of the i!o-constraint is smaller than the effective model size m*, the additional ^^-constraint does 
reduce the parameter searching space, but this reduction is not essential and the rate is equal to the 
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rate for q = Q.ln contrast, when the effective model size to.^ is smaller than A;„, the £o-constraint 
docs reduce the parameter searching space determined by the €g-constraint. but not essential from 
the uniform estimation standpoint and the rate is then log(l + Mn/m^,)/n. Clearly, both rates 
can be interpreted as the log number of models of size fc„ or over the sample size. 

5. Adaptive minimELx estimation under fixed design 

Consider the linear regression model (4.1) under fixed design, Yi ~ foi^i) + £i, * = 1, where 
Xi = {xi^i, . . . , Xi^M^y E X C M.^^" arc fixed, 1 < i < n, and the random errors Si are i.i.d. iV(0, cr^). 
Suppose maxi<j<M„ ^Ij/^ ^ 1- Let /q = (/o(xi), . . . , /o(x„))'. For any function f: X ^R, 

define the norm j| • ||„ by ||/||^j = ^ SiLi /^(xi)- Our goal is to estimate the regression mean /q 
through a linear combination of the predictors with the coefficients 9 satisfying a £q-constraint 
(0 < (? < 1). For an estimate / of /o, define its average squared error to be 

ASE{f) = \\f-fo\\l 

We consider subset selection based estimators. Let C {1, 2, . . . , Mn} be a model of size m 
(1 < TO < A/„). Our strategy is to choose a model using a model selection criterion, and the resulting 
least squares estimator is used for /q . The loss of a given model J„i is ASE[fj^) = — fSWii 

(with a slight abuse of notation), where Yj^ ~ O^i.Jm : ■ ■ • > ^n. J™)' is the projection onto the column 
span of the design matrix of model Jm- The alternative strategy of model mixing will be taken 
as well. Although our estimators do not directly consider the ^g-constraint, it will be shown to 
automatically adapt to the sparsity of /o in terms of £q-rcprcscntation by the dictionary. 

For a function class for the fixed design, define the approximation error (/q; J-) ~ inf f^jr || /— 
/olln- We will consider both a known and a unknown cases. As will be seen, the results arc quite 
different in some aspects, and an understanding on what the different assumptions can lead to is 
important to reach a deeper insight on the theoretical issues. 
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5.1. When a is known 
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For a model J,„ of size to (1 < m < 7\/„), the ABC criterion proposed in Yang (1999) is 

n 

where A is a pure constant, rj^^ is the rank of the design matrix of J„i, and Cj^ is the model index 
descriptive complexity. Let rjv/„ denote the rank of the full model J^v/^ , which is assumed to be at 
least 1. 

Let J denote the model that gives the full projection matrix /„xn (since the ASE at the design 
points is the loss of interest, this identity projection is permitted). We define ABC (J) = 2n<j^ + 
Xa^Cj. Let Jq denote the null model that only includes the intercept and define ABC{Jo) = 
Sr=i(^i — F)^ + 2cr^ + Xa^Cjg, where Y = J2^=i Yi/n. The model index descriptive complexity 
Cj satisfies Cj > and e~^' < 1, where the summation is over all the candidate models being 
considered. 

The subset models of size 1 < m < A/„ A n, the models Jq and J are considered with the 
complexity Cj,,^ = — logO.85 + log ((A/„ — 1) A n) + log (^j^") for a subset model with m < Af„, 
^Jm„ ^ —log 0.05 for the full model Ja/„, Cjq = —log 0.05 for the null model Jq, and Cj — 
— log 0.05 for the full projection model J. Note that for the purpose of estimating /J, there is no 
problem with duplication in the list of candidate models. 

Let r„ denote the set of all the models considered and the model chosen by the ABC criterion is 

J = arg min ABC (J). 

Jer„ 

The ABC estimator f j is the fitted value Yj. Let /,/ = Vj/q be the projection of into the 
column space of the design matrix of model J. 

For ease of presentation, define as follows: 



^J=-<,(t„;M„) 



if TO, = M„ A n, 

/l+l M„ \ l-g/2 

^2-,^W ifKTO, <M„A. 

(4V^)A^ ifTO, = l. 
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In addition, for lower bound results, let $-^<j(*"'^^") (0 < q < 1) and $-^<!(*";A^")n^o(fc„;A/„) (o < ^ < 1) 
be the same as (^^ii^r^\Mr,) g^j^^j ([):'='5(t„;A/„)nJ='o(fc„;A/„)^ respectively, except that when < g < 1 and 
= 1, takes the value tl A instead of (tl V ^) A and $^.(*-^f")n^o(fc„;A/„) 

is modified the same way. In the fixed design case, the ranks of the design matrices are certainly 
relevant in risk bounds (see, e.g., [65, 54]). 

Theorem 7. When A > 5.1 log 2, the ABC estimator f j simultaneously has the following properties. 

(i) For F ~ J-q{tn] Mn) with 0<q<l,orT^ J-o{kn; M„) with 1 < kn < Mn, or T — J-q{tn', -/Vf„)n 
J-'o{kn; Mn) with < q <1 and 1 < kn < M„, we have 

sup E{ASE{fj)) < 5$-^, 

where the constant B depends only on q and A for the first and third cases of J- , and depends 
only on A for the second case, 
(ii) In general, for an arbitrary f^, we have 

E{ASE{f^i))<B(\\f.,,,^-f-\\l+^ inf^^^ (\\f.j^~fj,,f ' 



n n I n I \\ n } ) 

where the constant B depends only on A. 

Remark 12. In (i), the case F ~ J-o{kn] Mn) does not require maxi<j<jv/„ J27=i ^1 jl^ — ^■ 
Remark 13. In pursuing the best performance in each case of the general risk bound in (ii) 
reduces to i?(f>-^ plus the approximation error dn{fo;J-) = inf/gjr ||/ — /o||^. 

For the lower bound results, as before, additional conditions are needed. Let S denote the design 
matrix of the full model Jm„ ■ 
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Assumption SRC: For some 7 > 0, there exist two positive constants a and a that do not 
depend on n such that for every 9 with \\9\\q < min(27, A'/„), we have 

a\\e\\2<^\m2<a\\0h. 

This condition is sUghtly weaker than Assumption 2 in [53], which was used to derive minimax 
lower bounds ioi < q < 1. 

Theorem 8. Suppose the noise e follows a normal distribution with mean and variance < 
(7^ < 00. For J- ~ J-q(tn]Mn) with < q < 1, or J- = J-Q(kn] Mn) with 1 < fcn < Af„, or 
T = T q(tn\M.n) H ^(^{kn; Mn) with < g < 1 and 1 < kn < Mn, under Assumption SRC with 
7 = m*, or kn, or fc„ A respectively, we have 

inf sup E{ASE{f)) > b' , 

f /oS^ 

where the estimator f is over all estimators, and the constant B depends only on a and a for the 
second case of T and additionally on q for the first and third cases of T . 

Remark 14. If SRC is not satisfied on the set of all the predictors but is satisfied on a subset of Mo 
predictors, as long as log log and log ^^^"^ are of the same order as log \ogj^, and 
log jj^'^^. , respectively, we get the same risk lower rates. When M„ is really large, this relaxation 
of SRC can be much less stringent for application. 

For the case q ~ 0, the achicvability of the upper rate is a direct consequence of [65]. The lower 
rates for q = and/or 1 arc given in [54], where the satisfiability of the SRC is also worked out. 
Raskutti et al. [53], under the assumption that the rank of the full design matrix is n, derived the 
minimax rates of convergence t« (log (M„) /7i)i-«/2 for < g < 1 in an in-probability sense for linear 
regression with fixed design with the £q-constraint when i\/„ ^ n and A/„/(t^n'^/^) > with 
some K G (0, 1). From our result, the ABC estimator simultaneously achieves the minimax rates of 
convergence for all < <7 < 1 and for all A/„ > 2 and t„ no smaller than order n^^/^, and also 
under the joint constraints when q = and < q < 1. We also need to point out that we only work 
on estimating the regression mean in this work, but [53] showed that, under additional conditions. 
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these upper rates are also vahd for the estimation of the parameter 9 under the squared error and 
verified their minimaxity. Concurrent work by Ye and Zhang [73] also derived performance bounds 
on the coefficient estimation that arc optimal in a sense of uniformity over the different designs. 

In application, the assumption that /o G J-q{tn] M„) or /o G J-(){k„; AI„) may sometimes be 
too strong to be appropriate. Thus, risk bounds that permit model mis-specification, i.e., /o ^ 
Fq{tn', Mn), are desirable. Part (ii) in the upper bound theorem (Theorem 7) shows that the ABC 
estimator handles model mis-specification. Indeed, for the different £q-constraints, the risk of the 
ABC estimator is upper bounded by a multiple of (i^j(/o; J^qit„; M„)) plus the earlier upper bounds, 
respectively. Therefore, model mis-specification or not, our estimator is minimax rate adaptive over 
the -hulls without any knowledge about the values of q, tn and kn (as long as i„ is not trivially 
small). 

One limitation of this result, from one theoretical point, is that the factor is larger than one in 
front of d^j(/o; J')- When the initial estimates need to be obtained based on the same data available, 
the multiplying factor being one no longer necessarily has any essential advantage. However, striving 
for the right constant is theoretically attractive when the elements in the dictionary are observed 
or truly provided by others. 

In that direction, recently, RigoUet and Tsybakov [54], by considering an estimator based on 
the mixing-lcast-square-estimators algorithm of Leung and Barron [46] with some specific choice of 
prior probabilities on the models, have provided in-expcctation optimal upper bounds for Iq- and/or 
^i-aggregation. With the power of the oracle inequality (or the index of resolvability bound), their 
estimator is shown to be adaptive over Iq- and £i-hulls. Their results do not address £g-aggregation 
for < q < 1. We next show that we can have an estimator that handles all < <Z < 1 in generality. 

The mixed least squares estimator by the mixing algorithm of Leung and Barron (2006) is given 

by 

fMLS Sr V -fi. 7rjexp{-^j/(4g^)} 

/ = > wjYj with wj = , 

,7ti^„ E,/'er„^j'exp{-i?,;'/(4^^)} 
where Rj = n\\Y — F/jj^j + 2rja^ — na"^ is the unbiased risk estimate for Yj. Let the prior on 
model J be chosen as ttj^ = 0.85 (((M„ - 1) A «)(*j:)) for 1 < m < (M„ - 1) A n, and 
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'^./m„ = T^./o = 7^,7 = 0-05. 

Theorem 9. Suppose < a < oo is known. For any Af„ > 1, > 1, the estimator f^^^^ 
simultaneously has the following properties. 

(i) For any < g < 1, i„ > 0, 



n ' 



1-9/2 



if = Mn A n, 



if I < < M„ A n, 



and 



EiASEif'^'^S)) < (^d^,(/o;-F,(t„;M„)) + 

A( ||/,7o-/?ll^.+ 



if 7Tl* = 1 



(a) For 1 < fc„ < Mn, 



E{ASE{f'''^^)) < diifo;To{kn;Mn)) + B2 



(Hi) For any < q < 1, tn > 0, and 1 < fc,i < Mn, 

EiASEiP'"^^)) < dlifo; Fq{tn; Mn) n J-o(fc„; Af„)) 



+^3 < 



1-9/2 



if 771^ > kn, 



if 1 < 777* < A:„ 



and 



E{ASE{f'"''^)) < 4(/o;J-,(t„;A./„)n J-o(fc„;M„)) + 



A Who- ml + ^ 



, if m^, = \ 



(iv) For every /o, we have 



E{ASE{f^'^^)) < BiU 
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For these cases, the constants Bi, B2, -B3 and i?4 are pure constants, and Bi and B3 depend on 

q- 

Remark 15. From (ii) above, by taking fc,i = 1, we have 



cr^ (1 + logM„) 0-2 ^A/„ 



E{ASE{r'^^)) < inf - r^Tn + B2 ^ " A 

l<j<M„ ■' " 

where /" = (xij-, . . . , x„j)'. Thus, we have achieved aggregation for adaptation as weh under the 
fixed design. 

The risk upper bounds above when q is restricted to be either or 1 or under both constraints 
are already given in Theorem 6.1 of [54]. The first four cases given there are clearly reproduced 
here (note that their cases 3 and 1 are just special case and immediate consequence, respectively, 
of their case 4, given in our bound in (ii)). Their case 5, a sparse aggregation with fc„ estimates 
as studied in [69] (page 36) and [49] (called D-convcx aggregation) is implied by our bound in (iii) 
with q taken to be 1. In the case g = 1, a minor difference is that if — /(^Hjj happens to be 

of a smaller order than <„ — A then our risk bound in (iii) yields a faster 

rate of convergence. In addition, our inclusion of the full projection model among the candidates 
guarantees that the risk of our estimator is always bounded, which is not true for the estimator 
in [54]. Our main contribution here is to handle adaptive ^^-aggregation for the whole range of q 
between and 1 . Note that the upper bounds in the above theorem have already been shown to be 
minimax-rate optimal under the conditions in Theorem 8. 



5.2. A comment on the model selection and model mixing approaches 

From the risk bounds in the previous subsection, we see that the model mixing approach leads to the 
optimal constant 1 in front of the approximation error (/o ; J-) for the three choices of J-, which 
is not the case for the model selection based estimator. However, the model selection approach may 
also have its own advantages. 

From the proof of Theorem 7 and proof of Theorem 1 in [65], besides the given risk bounds, 
we also have a general in-probability bound of the form: for any x > 0, there are constants c, c 
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(absolute constants) and c (depending on A and cr^) such that 

/ ASE{f^j) 



Rnifo) 



— > c + x \ < c cxp ^— c 



where i?„(/o) = inf jgr„ [\\fj ^ /o ||^ + + ^"^ri"'' ) is an index of resolvabihty, which speciahzes 
to the upper bounds in (i) and (ii) of Theorem 7, respectively in those situations. Thus, we know 
that not only ASE(f j) is at order Rnifo) with upper deviation probability exponentially small (in 
x), but also the complexity of the selected model, — is upper bounded in probability in the 
same way as well. In particular, for estimating a linear regression function with the soft or hard (or 
both) constraint(s) on the coefficients, the ABC estimator converges at rate ^ A 

both in expectation and with upper deviation probability exponentially small, where is the 
corresponding effective model size in each case. Furthermore, the rank (the actual number of free- 
parameters) of the model selected by ABC is right at order A rj\/^ with exception probability 
exponentially small. 

For model mixing estimators based on exponential weighting, however, to our knowledge, no 
result has shown that their losses are generally at the optimal rate in probability. In fact, a neg- 
ative result is given in [2] that shows that an exponential weighting based estimator optimal for 
aggregation for adaptation (i.e., its risk regret, or the expected excessive loss, is of order ^ jg 

necessarily sub-optimal in probability (with a non- vanishing probability its excessive loss is at least 



at the much larger order of y ) in certain settings. 

Thus, we tend to believe that both the model selection and model mixing approaches have their 
own theoretical strengths in different ways. 



5.3. When a is unknown 

Needless to say, the assumption that a is fully known is unrealistic. When a is unknown but is 
upper bounded by a known constant a > 0, similar results for rate of convergence can be obtained 
with a model selection rule different from ABC. 
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For this situation, Yang [65] proposed the ABC criterion: 



ABC'{J^) = 




which is a modification of Akaike's FPE criterion [1]. We define ABC {J) = (1 + 2n)Xa^Cj and 
ABC'{Jo) = (l + (Er=i(^* - '^f + ^^^C'jo) . The list of candidate models and complexity 

assignments need to be different for the different situations, as described below. 

1. When Mn < n/2, all the subset models, Jq and J are considered with the complexity Cj^ ~ 
- log 0.85 + log(A/„ - 1) + log (^^") for a subset model with m < M„, Cj^,,^^ = Cj„ = Cj ^ 
-log 0.05. 

2. When M„ > n/2 and rM„ > n/2, we only consider models with size m < n/2, the model Jo 
and the model J. Then we assign the complexity Cj^^ ~ — log 0.8 + log([n/2j) + log (^^") for 
a subset model, Cjg — Cj — — log 0.1. 

3. When M„ > n/2 and rM„ < n/2, we only consider models with size m < n/2, the full model 
Jm^, the null model Jo, and the model J. We assign the complexity Cj^ = — log 0.85 + 
log([7V2j) + log (^^;) for a subset model, C,/^^^ = Cj,-, =Cj = - log 0.05. 

In any of the cases above, let denote the set of all the models considered. The model chosen 
by the ABC is 

J' = arg min ABC' (J), 
producing the ABC estimator f f, ^ Yp. 

Theorem 10. When A > 40 log 2, the ABO estimator f p simultaneously has the following prop- 
erties. 

(i) ForF = Fq{tn\ ^In) with < (? < 1, or J" = J"o(fcn; Mn) with I < kn < Mn, orF^ J^q{tn;Mn)(l 
To{kn; M„) with < q < I and 1 < kn < Mn, we have 

sup E{ASE{fj,)) < B<^^, 

where the constant B depends only on q, X, a, a for the first and third cases of J-', and depends 
only on X, W, a for the second case. 
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(ii) In general, for an arbitrary /g\ we have 

EiASEifj,)) 

where the constant B depends only on A, W, a. 

Remark 16. For the results in (i), as seen before, when /o is not in the respective class of linear 
combinations, an obvious modification is needed by adding a multiple of the approximation error 
dnifoi^) in the risk bound. 

When < cr < oo is fully unknown, a model selection method by Baraud, Giraud and Huet [7] 
can be used to obtain results on ^^-regression. 

They consider a different modification of the FPE criterion [1]: 

where pen{Jm) is a penalty assigned to the model Jm- They devise a new form for pen{Jm) (Section 
4.1 in [7]) to yield a nice oracle inequality (Corollary 1) that does not require any knowledge of cr, 
but at the expense of excluding some large models in the consideration. When A/„ < (rt — 7) A ^rt 
for some < (t < 1, we consider all subset models in the model selection process. When Af„ is large, 
we consider only subset models with n — rj^ > 7 and m V log (^^") < <;n for a fixed < <; < 1. 
Combining the tools developed in this and their papers, we have the following result. 

Theorem 11. The BGH estimator f j has the following properties. 

(i) When M„ < (n — 7) A for F ~Fq{tn]Mn) with Q < q < I, or T = Toikn] Mn) with 
^ ^ kn < M„, or T — T q{tn\ Mn) H J-o{kn; M„) with < q < I and 1 < fc„ < M„, we have 

sup EiASEif})) < 5$-^, 

where the constant B depends only on q and <; for the first and third cases of J-', and depends 
on <r for the second case. 
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(a) For a general M„, i/m, satisfies < n — 7 and V log (^") < 'in, we have 



sup E{ASE{f j)) < B 

/o6:^,(t„;M„) 



where B depends only on q and // fc„ satisfies fc„ < n — 7 and kn V log ( ") < <^?^, we have 



,a2fc„(l + logf- 

sup E{ASE{fj)) < B 

/oeJ^o(fc„;A-f„) 

where B is a constant that depends only on i^. 

Remark 17. As before, when /o is not in the respective class, a multiple of the approximation error 
d^(/o; J^) = inf/ej^ \\.f ^ foWn needs to be added in the aggregation risk bound. 

From the above theorem, we see that when a is fully unknown, as long as Af„ < {n — 7) A irn for 
some < <r < 1, similar risk bounds to those in Theorem 10 for ^g-regression hold. However, when 
Mn is larger, the previous risk bounds are seriously compromised; 1) the possible improvement in 
risk due to low rank of the full model is no longer guaranteed; 2) the previous upper rates determined 
by the effective model size to* or fc„ are valid only when those model sizes are not excluded from 
consideration by the BGH criterion; 3) The risk is no longer guaranteed to be always uniformly 
bounded. Indeed, due to the restriction on the model sizes to be considered, the final risk here 
can be arbitrarily large. It turns out that this last aspect is not due to technical deficiency in the 
analysis, but it is a necessary price to pay for not knowing a at all (see [61]). 



6. Discussion 



Since early 1990s, sparse estimation has been recognized as an important tool for multi-dimensional 
function estimation. Emergence of high-dimensional statistical problems in the information age 
has prompted an increasing attention on the topic from theoretical, computational and applied 
perspectives. We focus only on a theoretical standpoint in the discussion below. 

To our knowledge, several lines of research on sparse function estimation in 1990s produced 
theoretical foundations that still provide essential understandings on ways to explore sparsity and 
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associated price to pay when pursuing sparse estimation from minimax perspectives. It has been 
discovered that for some function classes, sparse representations (in contrast to traditional full 
approximation) result in faster rates of convergence, which alleviate the curse of dimensionality 
when the problem size is large. Such function classes include, for example, Besov classes (e.g., [31]), 
Jones-Barron classes ([9, 42]) and may also be defined directly in terms of sparse approximation 
(e.g., [71], Section III.D). Regarding methods to achieve the optimal sparse estimation, wavelet 
thresholding with one or more orthonormal dictionaries and model selection with a descriptive 
complexity penalty term added to the sum of negative maximized likelihood (or a general contrast 
function) and a multiple of the model dimension have yielded successful theoretical advancements. 
Oracle inequalities/index of resolvability bounds have been derived that readily give minimax-rate 
adaptive estimators for various scenarios. In linear representation, ^i-constraints on the coefficients 
have been long known to be associated with fast rate of convergence for both orthogonal and non- 
orthogonal bases by model selection or aggregation methods, as mentioned in the introduction of 
this paper. 

It is worth noticing that these research works usually target nonparametric settings. In the past 
few years, the situation of a large number of naturally observed predictors has attracted much 
attention, shifting the focus to much simpler linear modeling. As pointed out earlier, the work in 
the 1990s on model selection has direct implications for the high-dimensional linear regression. For 
example, if the sum of the absolute values of the linear coefficients is bounded (^i-constraint), then 

1 /2 

the rate of convergence is bounded by (log n/n) as long as A'/„ increases only polynomially in n. 
If only kn terms have non-zero coefficients (^Q-constraint), then the rate of convergence is of order 
fc„(l -|- log(M„/fc,i))/n based on model selection with mild conditions on the predictors. However, 
such subset selection based estimators pose computational challenges in real applications. 

In the direction of using the £i-constraints in constructing estimators, algorithmic and theoreti- 
cal results have been well developed. Both the Lasso and the Dantzig selector have been shown to 
achieve the rate kn log(Af„)/n under different conditions on correlations of predictors and the hard 
sparsity constraint on the linear coefficients (see [34] for a discussion about the sufficient conditions 
for deriving oracle inequalities for the Lasso) . Our upper bound results do not require any of those 
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conditions, but we do assume the sparse Riesz condition for deriving the lower bounds. Computa- 
tional issues aside, we have seen that the approach of model selection/combination with descriptive 
complexity penalty has provided the most general adaptive estimators that automatically exploit 
sparsity natures of the target function in terms of linear approximations subject to £q-constraints. 

Donoho and Johnstone [30] derived insightful general asymptotic minimax risk expressions for 
estimating the mean vector in £q-balls {0 < q < oo) under £p loss (p > 1) in a Gaussian sequence 
framework. The work by Raskutti et al. [53] and by RigoUet and Tsybakov [54] are directly related 
to our work in the fixed design case. The former successfully obtains optimal non-adaptive in- 
probability loss bounds for their main scenario that is much larger than n for general < 9 < 1 
when the true regression function is assumed to be in the -hull. In contrast, our estimators are 
adaptive and the risk bounds hold without restrictions on Af„ or the "norm" parameter <„, also 
allowing the true regression function to be really arbitrary. The work of RigoUet and Tsybakov 
[54] nicely shows the adaptive aggregation capability of model mixing over and £i-balls. Our 
results are valid over the whole range of < g < 1. For lower bounds, our formulation is somewhat 
different from theirs. In addition, unlike those results, we have also provided results when the error 
variance is unknown but upper bounded by a known constant or fully unknown. Furthermore, our 
model selection based estimators have optimal convergence rates also in terms of upper deviation 
probability, which may not hold for the model mixing estimators. We need to point out that both 
[53] and [54] have given results on related problems that we do not address in this work. 

In our results, the effective model size (as defined in Section 2.5) plays a key role in determin- 
ing the minimax rate of ^^-aggregation for < (jf < 1 . With the extended definition of the effective 
model size m» to be simply the number of nonzero components /c„ when q = Q and re-defining 
to be TO* A kn under both Iq- {Q < q < 1) and ^o-constraints, the minimax rate of aggregation is 

•n ^ • 1 r ( l + log( .^^^ ) ) 

unined to be the smiple form 1 A — ^ . 

Risk bounds for selection/mixing least squares estimators from a countable collection of linear 
models (such as given in [65, 46]), together with sparse approximation error bounds, are essential 
for our approach to devise minimax optimal sparse estimation for fixed design. When the predictors 
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are taken as some initial estimates, the selection/mixing methods can be regarded as aggregation 
methods with the risk bounds as aggregation risk bounds. In a strict sense, however, these resuhs 
are not totally satisfactory for at least two reasons. First, the evaluation of performance only at 
the design points that have been seen already has limited value: i) The strengths of the candidate 
procedures may not be reflected at all on such a measure; ii) A small ASE on the design points 
docs not mean good behaviors on future predictor values. Second, when the initial estimates are 
not given (which is almost always the case), to combine arbitrary estimators, data splitting is 
typically necessary to come up with the candidate estimates and use the rest of the sample for 
weight assignment. Then, the final risk bounds, unfortunately, depend on how the data are split. 
In contrast, for the random design case, this is not an issue. We have also seen that because ASE 
cares only about the performance at the design points, given the i.i.d. normal error assumption, 
there is absolutely no condition needed on the true regression function, as pointed out in a remark 
to Theorem 1 in [65]. For random design, however, we have made the sup- norm bound assumption, 
but the risk bounds guarantee optimal future performance as long as the sampling distribution is 
unchanged. 

Regarding aggregation, we notice that the ^^-aggregation includes as special cases the state-of-art 
aggregation problems, namely aggregation for adaptation, convex and ZJ-convex aggregations, lin- 
ear aggregation, and subset selection aggregation, and all of them can be defined (or essentially so) 
by considering linear combinations under and/or i!i-constraints. Our investigation provides op- 
timal rates of aggregation, which not only agrees with (and, in some cases, improves over) previous 
findings for the mostly studied aggregation problems, but also holds for a much larger set of linear 
combination classes. Indeed, we have seen that ^o-a-ggregation includes aggregation for adaptation 
over the initial estimates (or model selection aggregation) (^o(l)-a'ggrcgation), linear aggregation 
when Mn < n (^o(-^'^n)-aggregation), and aggregation to achieve the best performance of linear 
combination of fc„ estimates in the dictionary for 1 < fc„ < Af„ (sometimes called subset selection 
aggregation) {iQ{kn)-a.ggvegaiion) . When Af„ is large, aggregating a subset of the dictionary under 
a £g-constraint for < g < 1 can be advantageous, which is just £o{kn) D £g(t„)-aggregation. Since 
the optimal rates of aggregation as defined in [58] can differ substantially in different directions of 
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aggregation and typically one does not know which direction works the best for the unknown regres- 
sion function, multi-directional or universal aggregation is important so that the final estimator is 
automatically conservative and aggressive, whichever is better (see [69]). Our aggregation strategy 
is indeed multi-directional, achieving the optimal rates over all £q-aggregation for < (7 < 1 and 
^0 n £q-aggregation for all < g < 1. 

One interesting observation is that aggregation for adaptation is essentially a special case of iq- 
aggregation, yet our way of achieving the simultaneous i'g-aggregation is by methods of aggregation 
for adaptation through model selection/combination. 

Aggregation of estimates and regression estimation problems are closely related. For aggregation, 
besides that the predictors to be aggregated are from some initial estimations (and thus are not 
directly observed), the emphases are: i) One is unwilling to make assumptions on relationships 
between the initial estimates so that they can have arbitrary dependence; ii) One is unwilling to 
make specific assumptions on the true regression function beyond that it is uniformly bounded and 
hence allow model mis-specification. In this game, there is little interest on the true or optimal 
coefficients in the representation of the regression function in terms of the initial estimates. 

Obviously, there are other directions of aggregation that one may pursue. The ^^-aggregation 
strategy that relies on aggregating subset choices of the initial estimates, as in [69], while producing 
the most general aggregation risk bounds so far, follows a global aggregation paradigm, i.e., the 
linear coefficients are globally determined. It is conceivable that sometimes localized weights may 
provide better estimation/prediction performance (see, e.g., [70]). Much more work is needed here 
to result in practically effective localized aggregation methods. 

Aggregation of estimates, as an important step in combining statistical procedures, has proven to 
bring theoretically elegant and practically feasible methods for regression estimation/prediction. It 
is an important vehicle to share strengths of different function estimation methodologies to produce 
adaptively optimal and robust estimators that work well under minimal conditions. Aggregation by 
mixing certainly cannot replace model selection when selection of an estimator among candidates 
or a set of predictors is essential for interpretation or business/operational decisions. 
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Our focus in this work is of a theoretical nature to provide an understanding of the fundamental 
theoretical issues about ^^-aggregation or linear regression under ^g-constraints. Computational 
aspects will be studied in the future. 

7. General oracle inequalities for random design 

Consider the setting in Section 3.2. 

Theorem 12. Suppose ^e-g holds for the E-G strategy, respectively. Then, the following oracle 
inequalities hold for the estimator fp^ ■ 

(i) For T-C and T-Y strategies, 
R{fF„;fo;n) 

■ f ( . f ,2., X- ^^ 1 + log (^")+ log(M„ An) - log(l - pp) 

< Co mf cimfd (/o;J^j„) + C2 h C3 ^ '- 

l<m<A/„Ari \ ./,„ m n — ni I 

/\co II /oil + C3 



n — ni 

where cq = 1, ci = C2 = Cl.ct, C3 = ^ for the T-C strategy; cq = Cy , ci = 0-2 = Clm, C3 = cr^ for 
the T- Y strategy. 

(a) For AC-C and AC-Y strategies, 



RifF„-Jo;n) 



< Co inf \ R{fo,m,n) + C2 h C3 



m 1 + log (^;-) + log(Af„ An) - log(l - po) 



l<m<M„/\n \ m n — ni 

Aco f ||/o||' + C3- 



n — ni 



where 



i?(/o, m, n) = ci inf inf d^(/o; J + 2c3 



,/,„ s>i y ■ n — ni 

and Co = ci = 1, C2 = 8c{a^ + 5L^), C3 = 3^7 for the AC-C strategy; co — Cy, c\ = 1, C2 
8c(ct^ + 5L^), C3 = CT^ /or i/ie AC-Y strategy. 
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From the theorem, the risk Ri fPn ^ fo]i^) is upper bounded by a multiple of the best trade-off of 
the different sources of errors (approximation error, estimation error due to estimating the linear 
coefficients, and error associated with searching over many models of the same dimension). For a 
model J, let IR{fo; J) generically denote the sum of these three sources of errors. Then, the best 
trade-off is IR{fo) = inf j IR{fo; J), where the infimum is over all the candidate models. Following 
the terminology in [10], IR{fo) is the so-called index of resolvability of the true function /o by the 
estimation method over the candidate models. We call /i?(/o; J) the index of resolvability at model 
J. The utility of the index of resolvability is that for /o with a given characteristic, an evaluation 
of the index of resolvability at the best J immediately tells us how well the unknown function is 
"resolved" by the estimation method at the current sample size. Thus, accurate index of resolvability 
bounds often readily show minimax optimal performance of the model selection based estimator. 

Proof, (i) For the T-C strategy, 

R{fF„;fo;n) 



< ^ inf \cL,.mfd^ifo;TjJ + CL, 



2 / log(M„ A n) + log {^^) - log(l - po 



l<m<M„An )'./,„ ' ™ ' Til Xq \ Tl — Hi 

2 logpo 



A ll/oll^ , 
For the T-Y strategy, 
^(/F„;/o;n) 

< Cy inf \cL.a-^ni<f{h;TjJ + CL,a— + cT 



2 



1 + log(Af„ A n) -f log (^") - log(l - Po) 



l<m<M„f\n ,/„, Til \ n — rii 

ACy(|l/o|P + a2 



2 , 2 1 - logPO 



71 — Til 
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(ii) For the AC-C strategy, 
^(/f„; /o;«) 

l<)ri<A/„Ari Jm s>l \ ni Ac \ n — Til 



71 — rii / / J t Ac n — 711 

< inf |infinff.^(/„;^f J + 8c(a^ + 5L^) ^ + A f ^ ") + (^^) ' ^"^(^ " 

l<m<M„Ari 1 ,/,„ s>l \ ™' 111 Ac \ 71-711 



A ll/oll^- 



?1 — ?ll / / J t Ac 71 — Til 

For the AC-Y strategy, 
^(/f„;/o;") 

< Cy inf jinfinf frf^(/o;^,^ J + c(2.- + ^ + / 1 + W A 7i) + log ri;) 

l<m<A/„An 1 J„ s>l \ 71i \ 71 - 71i 

log(l-po) + 21og(l + s)^|^|1 /iifi|2, 2l"logPo 

A Cy ll/oll + cr 



71 — Til / / J I 71 — 711 

< Cy inf jinf inf { d^jfo^T^ J + 8c(.^ + 5L^) ^ + f ^ + ^"^^^^ ^ + ^ ^ 

l<m<i\/„An I J„ s>l \ \ n — Til 

log(l -po) + 21og(l + s) W ) ^ ^ fii^ii2, 2l-logPo 



71 — 711 / / J L 71 — 711 



A Cy ll/oll^ + 



□ 



Remark 18. Similar oracle inequalities hold for the estimator /a under the linear regression set- 
ting with random design: ^^(/oS-^Jm) is replaced by c?^(/o; ^j,„), and X^jeJ ^ifi replaced by 
SjsJ ^i'^'i ^'^^ above theorem. 



8. Proofs 

Proof of Theorem 1. 

Proof, (i) Because {ej}^^ is an e-net of Fq{tn) if and only if {tn^ej}^^^ is an e/f„-net of J-q{l), 
we only need to prove the theorem for the case t„ = 1. Recall that for any positive integer fc, the 
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unit ball of ^g^" can be covered by 2^~^ balls of radius efc in f i distance, where 



Ci < c < 



1 1 < fc < log2(2M„) 

(^ iog.(i+^) ^ log2(2M„) < k < 2M„ 

2-2i&;r(2Af„)i-i fc > 2M„ 



(c.f., [32], page 98). Thus, there are 2*^ ^ functions g^, 1 < j < 2''' ^, such that 

2fc-l 

For any g G J-"i(efc), g can be expressed a.s g = X]f=i "^i/i with X^fii k^l — ^z^- We define a random 
function J7, such that 

F{U = sign(c,)efc/,) = |c,|/efc, P(C/ = 0) = 1 - ^ |c,|/efc. 

1=1 

Then we have |1?7||2 < fifc a-s. and EC/ = g under the randomness just introduced. Let Ui, U2, Um 
be i.i.d. copies of U, and let y = ^ Sl^i ^i- We have 



E||V^-.9||2 = J-||Var([/)||2 < < 
V m \ m 

In particular, there exists a realization of T^, such that \\V — 5II2 < t-kj^pm. Note that ^ can 
be expressed as t\^mr^(k\j\ + fc2/2 + ••• + ki\i^fM,J, where fci, fc2, ^j\f„ are integers, and 
l^il + 1^2! + ■ ■ • + |fcji/„ I < "Ti- Thus, the total number of different realizations of V is upper bounded 
by (^*^;;+"). Furthermore, ||y||o < m. 

If log2(2M„) < fc < 2Mn, we choose m to be the largest integer such that < 2^=. Then 

we have 

1 c' / 2M„ 
— < T log2 I 1 



m k \ k 
for some positive constant c'. Hence, ^g(l) can be covered by 2^''^^ balls of radius 



efcWc'fc-ilog2 ( 1 + 



in distance. 
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If k > 2Af„, we choose m = Af„. Then Tg{l) can be covered by 2*^-1(2^^^+™) balls of radius 
CkMn in distance. Consequently, there exists a positive constant c" such that -^g(l) can be 
covered by 2'^^ balls of radius r;, where 

1 1 < / < log2(2M„), 

ri<c"{ ;^-|[iog2(l + Mi^)]|-i log2(2M„) < ^ < 2Af„, 
2-2ifc(2Af„)5-i ; > 2M„. 



For any given < e < 1, by choosing the smallest I such that r; < e/2, we find an e/2-net {wijiLi 
of ^q(l) in distance, where 



N = 2'"^ < 



cxp ( c"'e"^ log(l + Afn' ' e) J e > A/i ' , 
exp ( c"'Af„ log(l + A/,r ' e-i) ) e < Afl"', 



and c'" is some positive constant. 

It remains to show that for each 1 < i < A^, we can find a function so that ||e»||o < 5e29/(9-2) + i 
and \\ei — Uj||2 < e/2. 

Suppose u,; = YJ^i ^vfj, ^<i<N, with ^^^1", |c,;,|9 < 1. Let = {j : |cy | > e2/(2-9)}. Then, 
|L,|e29/(2-?) < ^ < 1^ which implies \L,\ < £29/(9-2) ^^^^^ ^^jg^ 

2-2g 

Define Vi = ^jeL '^ijfj ^^'^ ^ "^ji^L-^iifi- have uj^ e J^i(e ^-g ). By the probability 

2-2g 

argument above, we can find a function w[ such that ||wi||o < and — Wi||2 < e ^""^ /V^w- 
particular, if we choose m to be the smallest integer such that m > 4,e'^i/(i~'^) . Then, ||wi — w-||2 < 
e/2. 

We define e; = Vi + w^. we have \\ui — ei||2 < e/2, and then we can show that 

||e,||o = Ik.llo + < l^.l +m< h^<i/(<i-^) + 1. 

(ii) Let fg = Y.f=iC]fj = arginf/ggjr^(t^) \\fg - /o|p be the best approximation of /o over the 
class J"g(<„). For any 1 < m < Af„, let L* = {j : \cj\ > tnin-^/'^}. Because Y.f=i Icjl'^ < 
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have \L*\tl/m < J2 Icjf < So, \L* \ < m. Also, 

Define v* ~ X^jgl* '^j/j ^^"-^ ~ J^ji^l* ^jfj- have w* G Define a random function 

1/ so that P([/ L>sign(cj)/j) = |cj|/i:', j L* and P(C/ = 0) = 1 - ^^j^^. |cj|/£i. Thus, 
EJ7 = ui*, where E denotes expectation with respect to the randomness P (just introduced). Also, 
\\U\\ < £'supi<j<^/^ ll/jll < D. Let Ui, U2, Um be i.i.d. copies of U, then Vx e A", 



e(^/o(x)-t;*(x)-1^C/,(x)^ 



(/,*(x)-/o(x))' + lvar(t/(x)). 
m 



Together with Fubini, 

2 

/o - - - E (7, < II/,* - fof + -E\\ur < wn /of + 

TO ^ — ' m 

i=l 

In particular, there exists a realization of tj* + X]i=i denoted by /gm, such that ||/e'" ^ /of < 
ll/e - /of + Note that ||/e-||o < 2m - 1. If we consider m = [(m + 1)/2J instead, we 

have 2to 1 < m and to > to/2. The conclusion then follows. 



□ 



Proof of Theorem 2. 

Proof. To derive the upper bounds, we only need to examine the index of resolvability for each 
strategy. The natures of the constants in Theorem 2 follow from Theorem 12. 

(i) For T- strategies, according to Theorem 1 and the general oracle inequalities in Theorem 12, 
for each 1 < to < Af„ A n, there exists a subset Jm and the best /g™ € Tj^^ such that 

j,(f f \ <■ I \\f .,,2^. "1^0 1 + log ( log(^^» An) - log(l - po) \ 

R[jF„;fo;n) < Co ci/e™-/o + 2c2 h 2c3 ^ — 

\ n n J 

Acof||./of + 203!^^°^ 

Under the assumption that /o has sup-norm bounded, the index of resolvability evaluated at the 
null model fe = leads to the fact that the risk is always bounded above by Co ( ||/of + ^-^f- ) for 



some constant Co, C2 > 0. 
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For J- = J-'q{tn), and when — m* — A/„ < n, evaluating the index of resolvabihty at the full 
model Jm„ , we get 



RifF„;fo;n)<coCid {h]Tq{tn))-\ with 



n n n 

Thus, the upper bound is proved when ~ m* ~ Af„. 

■ M 



For = J-q{tn), and when = m* ~ n < Mn, then clearly m» + log {^^^jj /n is larger 
than 1. and then the risk bound given in the theorem in this case holds. 

For J- — J-q{tn), and when 1 < m* < m* < Mn A n, for 1 < m < A/„, and from Theorem 1, we 
have 

R{fF„;fo;n) < co(cid2(/o;J-^(i„)) + ci22/?-i<>i-2/9 + 2c2^ 

^^^^ l + \ogCt) + \og{MnAn) _ ^^^ log(l-po) \ 
n n I 

Since log (^'•) < mlog {^) = m (l + log M^), then 

\ n I 

' / 9 19/ 77l(l+l0g^)\ 



where C and C" are constants that do not depend on n, t„, and Af„ (but may depend on tr^, po 
and L). Choosing rn — m*, we have 



n n 
The upper bound for this case then follows. 

For J- = J-o(fc„), by evaluating the index of resolvabihty from Theorem 12 at Tfi — krfi , the upper 
bound immediately follows. 

For J- = J-q{tn) n J-o{kn), both £q- and £o-constraints arc imposed on the coefficients, the upper 
bound will go with the faster rate from the tighter constraint. The result follows. 

(ii) For AC- strategies, three constraints \\0\\i < s {s > 0), \\0\\q < tn {0 < q < 1, tn > 0) and 
ll/elloo < L are imposed on the coefficients. Notice that \\6\\i < \\9\\q when < q < 1, then the 
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£i-constraint is satisfied by default as long as s > t„ and \\0\\q < tn with < q < I. Using similar 
arguments as used for T-strategies, the desired upper bounds can be easily derived. 



Global metric entropy and local metric entropy. The tools developed in Yang and Barron [72] 
allow us to derive minimax lower bounds for ^^-aggregation of estimates or regression under iq- 
constraints. Both global and local entropies of the regression function classes are relevant. The 
following lower bound result slightly generalizes Lemma 1 in [69]. 

Consider estimating a regression function /q in a general function class J- based on i.i.d. obser- 
vations (X,i,yi)"^j^ from the model 



where cr > and e follows a standard normal distribution and is independent of X. 

Given F , we say G C J-^ is an e-packing set in J- [e > 0) if any two functions in G are more than 
e apart in the L2 distance. Let < a < 1 be a constant. 

Definition 1 : ( Global metric entropy) The packing e-entropy of F is the logarithm of the largest 
e-packing set in J-. The packing e-entropy of T is denoted by M{e). 

Definition 2: {Local metric entropy) The a- local e-entropy at / e is the logarithm of the 
largest (Q;e)-packing set in e) = {/ G J-" :|| / — / ||< e}. The a-local e-cntropy at / is denoted 
by Ma{e \ /). The a- local e-cntropy of F is defined as Af^°'^(e) — maxft^jr Ma{e \ /). 

Suppose that A/^°^ (e) is lower bounded by A/'°'^(e) (a continuous function), and assume that 
M{e) is upper bounded by M{e) and lower bounded by M(e) (with A'/(e) and M(e) both being 
continuous) . 

Suppose there exist e„, e„, and e„ such that 



□ 



Y = /o(X) + a-e, 



(8.1) 



M^'i^^n) > ne^ -I- 2 log 2, 



(8.2) 




(8.3) 




4ne2 +21og2. 



(8.4) 
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Proposition 5. (Yang and Barron [72]) The minimax risk for estimating /q from model (8.1) in 
the function class T is lower-hounded as the following 

Msup E\\f- for >^^, 
f fo&r o 

inf sup £;||/-/of >^. 

Let T_ he a subset of If a packing set in T of size at least exp(M^'^(cre„)) or ex'p(M(ae„)) is 

^ 2 2 2 2 2 

actually contained in T_, then inf j: sup^^^^jr — /o|P is lower hounded hy — g— ^ or g"" , respec- 
tively. 

Proof. The result is essentially given in [72], but not in the concrete forms. The second lower bound 
is given in [69]. We briefly derive the first one. 

Let N be an (ae„)-packing set in B{f, (Te„) = {/' G ^ : 1| /' — / 1| < cren}- Let Q denote a uniform 
distribution on N. Then, the mutual information between 9 and the observations pi.i,Yi)"^i is 
upper bounded by (see Yang and Barron [72], Sections 7 and 3.2) and an application of Fano's 
inequality to the regression problem gives the minimax lower bound 

f. _ /(9;(X„y,)f^i)+log2 \ 
4 ^ log|7V| J' 

where [A^l denote the size of N. By our way of defining e„, the conclusion of the first lower bound 
follows. 

For the last statement, we prove for the global entropy case and the argument for the local entropy 
case similarly follows. Observe that the upper bound on / (O; (X^, by log(|G|) + ne^^, where 

G is an e„-net of T under the square root of the Kullback-Lcibler divergence (see [72], page 1571), 
continues to be an upper bound on / (9; (Xi,!^)"^]^) , where 9 is the uniform distribution on a 
packing set in £. Therefore, by the derivation of Theorem 1 in [72], the same lower bound holds 
for T as well. 

□ 
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Proof of Theorem 3. 

Proof. Assume /o € in each case of F so that cP{fo-,T) = 0. Without loss of generahty, assume 
cr = 1. 

(i) We first derive the lower bounds without L2 or L^o upper bound assumption on /q. To prove 
case 1 (i.e., F — Tq{tn)), it is enough to show that 



if fa* = M„, 



inf sup E\\f^for>C,{ 



M„ 
n 

n'"'"'' if K TO* <m* < A/„, 

tl if TO*= 1, 



in light of the fact that, by definition, when to* = M„, to* = M„ and when 1 < m* < to* < 

n 

1-9/2 



M„, we have — s ™, ) uppcj- and lower bounded by multiples (depending only on q) of 



I . Note that to* and TO* are defined as to* and m* except that no ceiling of 

n is imposed there. 

Given that the basis functions are orthonormal, the L2 distance on Tq(tn) is the same as the £2 
distance on the coefficients in -Bq(t„; M„) = {9 : \\0\\q < i„}. Thus, the entropy of J-q{tn) under the 
L2 distance is the same as that of _Bq(<„;A/„) under the £2 distance. 

When fa* = Af„, we use the lower bound tool in terms of local metric entropy. Given the 



^2-rclationship \\e\\q < M„^/«"^/^||6'||2 for < 9 < 2, for e < ^M^jn, taking /„* = 0, we have 
fi(/o;e) = {/f : ll/e - /o II < ll^^ll? < *4 = {fe : H^lh < e, H^H, < t„} = {fe : H^IU < e}, 



where the last equality holds because when e < -^/A/n/n, for ||^^||2 < e, ||6'||q < t„ is always satisfied. 



Consequently, for e < ^/Iv^Jn, the (e/2)-packing of S(/Q;e) under the L2 distance is equivalent to 
the (e/2)-packing of = {0 : \\0\\2 < e} under the I2 distance. Note that the size of the maximum 

e/2 



packing set is at least the ratio of volumes of the balls and B^/2, which is 2^^". Thus, the local 



entropy AfJ°2(e) of J^q{t) under the L2 distance is at least M'^ifoje) = Af„log2 for e < ^J~M^^Jn. 
The minimax lower bound for the case of to* — A/„ then directly follows from Proposition 5. 

When 1 < TO* < TO* < Af„, the use of global entropy is handy. Applying the minimax lower 
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bound in terms of global entropy in Proposition 5, with the metric entropy order for larger e (which 
is tight in our case of orthonormal functions in the dictionary) from Theorem 1 , the minimax lower 
rate is readily obtained. Indeed, for the class J-q{tn), with e > t^Mn ' , there are constants c' and 
c' (depending only on q) such that 

c' {tr,e-^)^^ \og{l + M^r'^t-^e) < M{e) <Tl{e) < c' (t„e-i) ^ log(l + Mi'^t-^e). 

Thus, we see that e„ determined by (8.4) is lower bounded by c tn (^{1 + log j^^^^j^) / n 
where c is a constant depending only on q. 

When TO* — 1, note that with /q = and e <tn, 

^(/o;e) = Ifo ■■ mu < eA\e\u < U,} D {fe ■■ \\o\U < e}. 

Observe that the (e/2)-packing of {fg : \\0\\q < e} under the L2 distance is equivalent to the 
(l/2)-packing of {fg : \\9\\q < 1} under the same distance. Thus, by applying Theorem 1 with 
tn = I and e = 1/2, we know that the (e/2)-packing entropy of S(/Q;e) is lower bounded by 
c" log(l + iM,y«-^/') for some constant c depending only on q, which is at least a multiple of nt^ 
when m* < (1 + log|^)'^/^ Therefore we can choose < S < 1 small enough (depending only on 
q) such that 

c" log(l + iA/y«-'/') > nSX + 2 log 2. 

The conclusion then follows from applying the first lower bound of Proposition 5. 

To prove case 2 (i.e., J- ~ J^oikn)), noticing that for M„/2 < kn < Mn, we have (l+log2)/2Af„ < 
kn ^1 + log Tp-^ < Mn, together with the monotonicity of the minimax risk in the function class, 
it suffices to show the lower bound for fc„ < A/„/2. Let Bk„{e) = {9 : \\0\\2 < e, \\d\\o < kn}. As in 
case 1, we only need to understand the local entropy of the set Bk^{e) for the critical e that gives 
the claimed lower rate. Let r/ = ej^fk^. Then Bk„{e) contains the set Dk^^ijf), where 

DM = {e = fjl:le {1,0, -1}^^", ||/||o < k}. 

1/2 

Clearly II77/1 — 77/2 1| 2 > V {dHAiili, I2)) , where dHAiili, I2) is the Hamming distance between 
^ii-^2 G {l7 0:^l}^^"- From Lemma 4 of [53] (the result there actually also holds when requiring 
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the pairwise Hamming distance to be strictly larger than k/2; see also the derivation of a metric 
entropy lower bound in [45]), there exists a subset of {/ : / e {1, 0,-1}^^", ||/||o < k} with more than 
exp ^flog^^^^^j — points that have pairwisc Hamming distance larger than k/2. Consequently, 
we know the local entropy of J^o{kn) is lower bounded by ^ log ^(^-^^^ _ xhe result 

follows. 

To prove case 3 (i.e., Fq{tn) n J"o(A;„)), for the larger fc„ case, from the proof of case 1, we 
have used fewer than A:„ nonzero components to derive the minimax lower bound there. Thus, 
the extra ^g-constraint docs not change the problem in terms of lower bound. For the smaller A:„ 



case, note that for B with ||6l||o < ||6l||, < kl/'^ ^'^Ph < k\l'^^^^'^- Jcfc„(l + log^) / 



for B with j|(?j|2 < \ l Ckn ( 1 + log 1 jn for some constant C > 0. Therefore the £g-constraint is 



automatically satisfied when \\0\^2 is no larger than the critical order \^kn{\ + log /n, which 
is sufficient for the lower bound via local entropy techniques. The conclusion follows. 

(ii) Now, we turn to the lower bounds under the Li norm condition. When the regression function 
/o satisfies the boundedness condition in L2 norm, the estimation risk is obviously upper bounded 
by I? by taking the trivial estimator / = 0. In all of the lower boundings in (i) through local 
entropy argument, if the critical radius e is of order 1 or lower, the extra condition ||/o|| < L 
does not affect the validity of the lower bound. Otherwise, we take e to be L. Then, since the 
local entropy stays the same, it directly follows from the first lower bound in Proposition 5 that 

is a lower order of the minimax risk. The only case remained is that of (l + log ^^)'^^^ < 

/ \ 1-9/2 

m* < Mn- If n 1 + log i^^^2yi/2 )/"■] is upper bounded by a constant, from the proof of the 

lower bound of the metric entropy of the £q-hal\ in [45] , we know that the functions in the special 

packing set satisfy the L2 bound. Indeed, consider {fg : 9 G Drn„{''l)} with m„ being a multiple of 

(</ (1 + log (ji^2'-^q/2 j j s-iid 77 being a (small enough) multiple of y^(l + log i^^iY'n)ln. Then 

/ \ 1-9/2 

these /e have [j/ell upper bounded by a multiple of ( (1 + log ^„^2 )/" ) ^^"^ fl^^ minimax 



lower bound follows from the last statement of Proposition 5. If t^^ Ml + log (-,^^2 )/"- J is 
not upper bounded, we reduce the packing radius to L (i.e., choose 77 so that rj^/m^ is bounded 
by a multiple of L). Then the functions in the packing set satisfy the L2 bound and furthermore. 
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/ \ 1-9/2 

the number of points in the packing set is of a larger order than nf^ Ml + log -jj^^r^^) / nj 
Again, adding the L2 condition on /o G J'qit) does not increase the mutual information bound 
in our application of Fano's inequality. We conclude that the minimax risk is lower bounded by a 
constant. 

(iii) Finally, we prove the lower bounds under the sup-norm bound condition. For 1), under 
the direct sup- norm assumption, the lower bound is obvious. For the general M„ case 2), note 
that the functions /g's in the critical packing set satisfies that 116*112 < e with e being a multiple 
of y^ ^''(-'-+'°g kn ) ^ j'l^Qi^ together with ||6'||o < fc„, we have \\0\\i < Vfc^||0||2, which is bounded 
by assumption. The lower bound conclusion then follows from the last part of Proposition 5. To 
prove the results for the case M„/ (1 + log ^) < bn, as in [58], we consider the special dictionary 
Fn ^ {h ■■ I < i < Mn} on [0, 1], where 

/i(x) = ^M^/ri^ ^Jx), i = l,...,Mn. 



Clearly, these functions are orthonormal. By the last statement of Proposition 5, we only need 
to verify that the functions in the critical packing set in each case do have the sup-norm bound 
condition satisfied. Note that for any fg with 6 G Dk^{ri) (as defined earlier), we have ||/gj| < i]\/k^ 
and IL/elloo < TyV-^-^n- Thus, it suffices to show that the critical packing sets for the previous lower 
bounds without the sup-norm bound can be chosen with 6 in Dkn{r]) for some rj = O [Mn 



Consider 77 to be a (small enough) multiple of y ^1 + log /n = O (which holds under 

the assumption .^^^^^"^I„ ^ bn). From the proof of part (ii) without constraint, we know that there 
is a subset of _Dfc^(ry) that with more than exp(^ log ^(^-^^~''") ) points that are separated in £2 
distance by at least kn (^l + log -1^^ /n. 

□ 



Proof of Theorem 4- 

Proof. For linear regression with random design, we assume the true regression function /o belongs 
to J^^(t„; Mn), or F^lkn] Mn), or both, thus cP{fo, J-) is equal to zero for all cases (except for AC- 
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strategies when F = ^^{kn; Mn), which we discuss later). 

(i) For T- strategies and J- = A'/„). For each 1 < m < A/„ A n, according to the general 

oracle inequalities in Theorem 12 , the adaptive estimator has 

p.? r . ^ "^^o 1 + log (*!") + log(M. An) - log(l - po) \ 

sup i?(/A;/o;n) < Co 2c2 h 2c3 ^-'^ 

\ n n I 



Aco ( ||/oir-2c3 



2 r,„ lOgPO 



When = m* = Mn < n, the full model Jj\/^^ results in an upper bound of order Mn/n. 
When = m* = n < Af„, we choose the null model and the upper bound is simply of order 



When 1 < TO* < m* < M„An, the similar argument of Theorem 2 leads to an upper bound of or- 
der lA^ (l + log ^) . Since (nt^ )<?/2 (i + bg j^^Y'^^ < to, < 4(ni2 )9/2 (l + log ^^^^) 

then the upper bound is further upper bounded by Cqtl- — for some constant 

Cg only depending on q. 

When TO, = 1, the null model leads to an upper bound of order j|/o|P + < f^i + ^ < 2(t^ V i) 
a foeT^itn;M„). 

For J" = J^Q {kn] Mn) or J" = J'^(t„; M„) n J(f (fc„; Af„), one can use the same argument as in 
Theorem 2. 

(ii) For AC- strategies, for J" = J'^(t„; M„) or J" = J'^(i„; M„)n (fc„; M„), again one can use 
the same argument as in the proof of Theorem 2. For T ~ F^{kn \ Mn), the approximation error is 

infs>i (inf{0:||e||i<s,||e||o<fc„,||/ei|=e<i} Wfo - /o|P + Zcsi^^iiM^ < inf{e:||0||i<a„,||e||o<fc„,||/8||oc<i} \\fe~ 
fof + 2c3i2l(i±£i) = 2c3^2s(i±£:i) jf ^ J-jf (fc„; Af„). The upper bound then follows. 

□ 



Proof of Theorem 5. 
Proof. Without loss of generality, we assume <t^ 



= 1 for the error variance. First, we give a simple 
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fact. Let Bki-n) = {B : ph < ??, H^Ho < k} and 6fc(/o;e) = {fe : < e, H^Ho < k} (take /o - 0). 
Then, under Assumption SRC with 7 = /c. the ^-local e-packing entropy of Sfc(/o;e) is lower 
bounded by the i-local 77-packing entropy of Bk{rj) with 77 ~ e/a. 

(i) The proof is essentiaUy the same as that of Theorem 3. When m* = M„, the previous 
lower bounding method works with a slight modification. When (l + log ■^)''^^ < m* < Mm 
we again use the global entropy to derive the lower bound based on Proposition 5. The key is 
to realize that in the derivation of the metric entropy lower bound for {9 : \\0\\q < t„} in [45], an 
optimal size packing set is constructed in which every member has at most non-zero coefficients. 
Assumption SRC with 7 = m» ensures that the L2 distance on this packing set is equivalent to the 
£2 distance on the coefficients and then we know the metric entropy of J-g(t„;Af„) under the L2 
distance is at the order given. The result follows as before. When m* < (l + log ^pr)''^^ , observe 
that Fq{tn\Mn) D {Pxj I |/3| < t„} for any 1 < .7 < The use of the local entropy result in 
Proposition 5 readily gives the desired result. 

(ii) As in the proof of Theorem 3, without loss of generality, we can assume kn < M„/2. Together 
with the simple fact given at the beginning of the proof, for Bk^ (e/a) = {0 :||^i|2 < e/a, ||^||o < fcn}, 
with 77' = e/{a^/k^), wc know Bk„{e/a) contains the set 

{0 = 7/7 :/€{!, 0,-l}^^",|l/||o<fc„}. 

For 6*1 = 77'/!, 02 ~ ii I2 both in the above set. by Assumption SRC, H/si ^ /e2 IP ^ ^dHM{Ii, I2) > 
g?e^/ (2a^) when the Hamming distance dnAiili, I2) is larger than kn/2. With the derivation in the 
proof of part (i) of Theorem 3 (case 2), we know the local entropy -^^^°'(^Q;)(e) of ^o{kn\ A/„) n 
{fe ■ ||^||2 < On} with a„ > e is lower bounded by ^ log ^^^"^^ — Then, under the condition 
a„ > C'^Jkn (^1 + log /'''■ for some constant C, the minimax lower rate k„ + log 

lows from a slight modification of the proof of Theorem 3 with e — C fc„ ^1 + log /n for 



some constant C" > 0. When < a„ < Cy fc„ ^1 + log j /71, with e of order a„, the lower bound 
follows. 

(iii) For the larger kn case, from the proof of part (i) of the theorem, we have used fewer than kn 
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nonzero components to derive the minimax lower bound there. Thus, the extra i!o-constraint does not 
change the problem in terms of lower bound. For the smaller fc„ case, note that for 6 with j|0||o < fcn, 
ll^ll, < fcy«-'/'||0||2 < fcy'''/yCfc„ (l + log ^) /n for e with ph < ^CK (l + log ^) /n. 
Therefore the ^^-constraint is automatically satisfied when \\6\\2 is no larger than the critical order 
kn (^l + log which is sufficient for the lower bound via local entropy techniques. The 

conclusion follows. 

□ 



Proof of Theorem 6. 

Proof, (i) We only need to derive the lower bound part. Under the assumptions that sup^ ||Xj||oo < 
Lq < Qo for some constant Lq > 0, for a fixed t„ = t > 0, we have V/e e J^g(t„;M„), ||/6/||oo < 
supj IjXjIloo • X]j=il^jl — -^oll^lli < -^oll^'llg < L^t. Then the conclusion follows directly from 
Theorem 5 (Part (i)). Note that when t„ is fixed, the case m* = 1 needs not to be separately 
considered. 

(ii) For the upper rate part, we use the AC-C upper bound. For fe with |10||oo < Lq, clearly, 
we have \\0\\i < MnLo, and consequently, since log(l + il/„Lo) is upper bounded by a multiple 
of kn ^1 + log -|^^ , the upper rate ^ (^^ + log A 1 is obtained from Theorem 4. Under the 

assumptions that sup^ ||Xj||oo < Lq < oo and kn^ ^1 + log /n < \/Kq, wc know that V/e e 

J^oikn] Mn) f]{fe : \\0\\2 < an} with a„ = kn (l + log /n for some constant C > 0, the 
sup-norm of fe is upper bounded by 



4^ r- 1 + logf" r- 

II 2^ djXjWac^ < ioll^lll < ioVfcn«n = CLokn\ — < C \/ KqLq. 

Then the functions in J^oC^n; -^^n) HI/ • ll^lb 5; fiji} have sup-norm uniformly bounded. Note that 
for bounded a„, ||0j|2 < implies that ||0||oo < On- Thus, the extra restriction ||0||oo < -^o does 
not affect the minimax lower rate established in part (ii) of Theorem 5. 

(iii) The upper and lower rates follow similarly from Theorems 4 and 5. The details are thus 
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skipped. 



□ 



Now we turn to the setup in Section 5 with known. 
Proposition 6. (Yang [65], Theorem 1) When A > 5.11og2, we have 



E{ASE{fj))<BM i\\fj-fo'L 



|2 



Jer„ n n 

where B > is a constant that depends only on A. 

Proof of Theorem 7. 

Proof. The general case (ii) is easily derived based on our estimation procedure and Proposition 6. 

To prove (i), when J- ~ J-q(tn',Mn), according to the upper bound in (ii) and Theorem 1, when 
/o <= J^q{tn; AIn) , for any 1 < m < {M,-, — 1) An, there exists a subset Jm and /g™ g such that 

msEifj)) 

' ' ,2 , fr'O™ , a2log(Af„An) a'^\o^{^^^^)\ a'^ruA 



n 



Ai3(^(^||/jo-/?ll„ + -j Aa^^ 



AS I I + — I A CT^ 

n 



Since log (^^^") < m (l + log and log A/„ < ?ti (l + log ^) , then for models with size 1 < m < 
(M„ — 1) A n, we have 



rt n n 



where B' only depends on q and A. 
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2 

When = m* ~ A/„ A n, the fuU model Ja/„ leads to an upper bound of order . When 

1 < TO* < 771* < M„ A n, we get the desired upper bounds by evaluating the risk bounds choosing 
Jm, and Jun- When to* = 1. models Jq and Jj\/^ result in the desired upper bound. 

The arguments for cases J- = J^o(fcn; -^^n) and T = Tq{tn] M„) D J-'olkn] M^) are similar to those 
of Theorem 2 and above with rj^ replacing to in the upper bounds. 

□ 

Proof of Theorem 8. 

Proof. Without loss of generality, assume the error variance ct^ = 1. Let Pfiy") = 11"= i ^^P ( 
— 5(2/2 — /(^i)^)) denote the joint density of = (Fi, , where the components are inde- 

pendent with mean /(x.^) and variance 1, I < i < n. Then the KuUback-Leibler distance between 
P/,(2/") andP/,(y") is 

1 " 

To prove the lower bounds, instead of the global L2 distance on the regression functions, we need 
to work with the distance /2) = \/E"=i (M^i) - f2i^t)f- 

First consider the case J- = J^g(t„;Af„). Let Bk{ri) — {9 : \\d\\2 < 1]^ \\6\\o < k} and Bk{fo;e) = 
{/e '■ \\fe\\n < £1 ll^llo < k} (/o = 0). Then, under Assumption SRC' with 7 = fc, the ^-local 
e-packing entropy of S/j(/o;e) is lower bounded by the ^-local 7y-packing entropy of Bk(ri) with 
rj = =. When 7 — to*, the proof is the same as the proof of Theorem 5. 

Now consider the case F = J^oikn; M„) and again assume fc„ < M„/2 as in the proof of Theorem 
5. When Assumption SRC holds with 7 = fc„, the lower bound is of order ^"(i+'QS^-^"/'"'") 3,3 before 
in the random design case. The proof for the last case = Fq{tn\ Mn) n Fo{kn] Mn) is similarly 
done as in the proof of Theorem 5. 

□ 
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Proof of Theorem 9. 

Proof. According to Corollary 6 from [46], we have 

EiASEU-^s^) < M (ll/, - fnl + ^ + 

which is basically the same as Proposition 6 with B = 1. Thus, the rest of the proof is basically the 
same as that of Theorem 7. 

□ 

To prove Theorem 10 , we need an oracle inequality, which improves Theorem 4 of [65], where 
only a convergence in probability result is given. Suppose that only the subset models J,„ with rank 
fj^ < n/2 are considered (which is automatically satisfied when M„ < n/2). Let F denote these 
models. (More generally, a risk bound similar to the following holds if we consider models with size 
no more than (1 — p)n for any small p > 0.) Let Cj be the descriptive complexity of the model J 

in r. 

Proposition 7. When A > 40 log 2, the selected model J' by ABC satisfies 

EiASEifj,)) < B inf (||/, - f^^ + ^ + ^) , 
where B is a constant that depends on A, ct^, and cP' . 

Remark 19. If wc add models with rank rj > n/2 into the competition, as long as the complexity 
assignment over all the models is valid (i.e., satisfying the summability condition), if wc can show 
that for these added models, ABC {J) are also upper and lower bounded with high probabilities as 
in (8.5) and (8.6), then the risk bound in the proposition continues to hold. 

Proof. Let e„ = (ei, . . . ,£„)'. For ease in writing, we simplify to ||-||^ in this proof. From page 
495 in [65], for each candidate model J, we have 

ABC [J) = \\Ajf^f + rj (^^^ {\\yn-Y.j\\'' + \o''Cj)-nA + Aa^Cj + 2remi( J) + rem2( J), 
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where ||A.7/o"||2 = 117, - /^^||2, renii(J) = e'^^fS - MjfS) and reni2(J) = rj- e'„Mjen. Note also 
that ||y„ - + Aa^Cj = WAjfX + (" " rj)a^ + {e'^Aje^ - (n - T.j)a^) + 2e;,Aj/„ + Aa^Cj. 
Let 

T(J) = II A//o"ll' + {n- rj)a^ + Xa^Cj, and ni?„(J) = Pj/o"f + rja^ + Xa^Cj. 

As is shown in the proof of Theorem 1, [65], if A > /i(ti,T2) = max(sup^>Q((2(log 2)^)^/2/ti — 
supp>o(p/'^2 — 1)2 (log 2)/ (p — log(p + 1))) for some constants ti and T2 with 2ti + T2 < 1, then 
for any (5 > 0, with probability no less than 1 — 5(5, |rcmi(J)| < Ti(ni?„(J) + gi{6)), |rcm2(J)| < 
T2{nRn{J)+92{S)), and \e',,Ajen-{n~rj)a^\ < T2{T{J)+g2{6)), where = 92(6) = Alog2(l/(5). 
Then with probability no less than 1 — 5(5, we have 

V n~r,j ) 

-2T^{nR^{J) + gi{5)) - T2{nR„{J) + g2{5)) + Xa^Cj 

. II, rnu2^ f 2{l-{2n+T2))T{J) 2(2rigiffl+r2.g2((5)) 2 

> \\AjJo II + r/ cr 

\ n — rj n — rj 

-(2ti + r2)ni?„(J) - (2ri.gi(,5) + r2ff2('5)) + Xa^Cj 

> \\A.,fS\? + 0(1 - (4n + 2r2))a^ - ^'-^^^^^^^^'^ + ^^^^^'^^ 

n — rj 

-(2ri + T2)nR.n{J) - (2Ti.gi(,5) + T2g2{5)) + Xa'^Cj 

> (1 - (6ti + 3r2))nii'„(J) -{2ngi{6) + T2g2{S)). (8.5) 

n — rj 

Suppose 6ri + 3t2 < 1. Let J„ be the candidate model that minimizes i?„(J). Then with exception 
probability less than 5(5, we have 



(T^ + (2ti +r2)ni?„(J„) 



" + '■''-(2n5i(<5) + r252(^)) + Xw'Cj^ 



n - rj^ 

Since T{Jn)/{n - rjj = (1 + rjj{n - rjj)i?„( J„) + (1 - rjj{n - rjj)a^ < 2R„{Jn) + (t^, then 

ABC'(J„) < (5 + 14ri + 7T2)7ii?„(J„) + '^{2Tigi{5) + T2,g2((5)). (8.6) 

Thus, for any (5 > 0, when the sample size is large enough, we have that with probability no less 
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than 1 — 5(5, 

ABC'{.r) + Ii^(2rigi(<S) + r^g^{5)) 



nRn[J') < 



1 - (6ri + 3t2) 
ABC'iJn) + ^(2rigi(5) + ra^aW) 

- 1 - (6ti +3r2) 

(5 + 14ti + 7T2)ni?„(J„) + !l±ri^(2Ti5i(<5) + r252((5)) + '^i2ng,{S) + T2g2{S)) 

- 1 - (6ti + 3t2) ■ 

Thus, with probabihty at least 1 — 56, 



(^ + ^) (2rm(<5)+r2.g2W) 



1 - (6ri + 3t2) (1 - (6ri + 3T2))ni?„( J„) 



^ 5 + 14ti + 7t2 



(^ + ^) (2r,g,(<5)+r2.g2W) 



< 



l-(6ri+3T2) (1-(6ti+3t2))(t2 
5 + 14ti + 7r2 6(2rigi((5) + T2g2('?)) 
l-(6ri+3T2) (1 - (6ti +3T2))fj2 ' 



Let 



^i?„(J„) 1 - (6ti + 3r2) J " (1 - (6ti + 3t2)W 

Then P > -log2(5) < 5(5 for < (5 < 1. Since E{W+) = P{W > t)dt < 5 2-*dt = 
5/ln2 and i?„(J„) < (aVa^) inf jgr -R„(/o; J) where i?„(/o; J) = ||7j -/o"ll' +0^7" + Aa^Cj/n, 
then we have 



infjgr^n(/o;-'^) Rn{Jn) Mjfzr R„{fo;J) 

5 + 14ri + 7r2 30(2ti + T2)A \ fa^ 



',1 - (6ri +3t2) ' (ln2)(l-(6Ti + 3T2))crV 
So E{ASE{fj,)) < Bmij^rRn{.fo',J), where the constant B depends on ti, T2, ct, and a. Mini- 
mizing /i(ti, T2) over ti > and T2 > in the region 6ti + 3r2 < 1, one finds a minimum value less 
than 40 log 2. Thus, the results of the theorem hold when A > 40 log 2. 

□ 



Proposition 7 may not provide optimal risk rate when tm^ is small, or when tm^ is larger than n/2 
(in which case the risk bound on E{ASE{J')) can be arbitrarily large because the approximation 
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errors can be arbitrarily large when the models are restricted to be of size n/2 or smaller). The issue 
can be resolved by considering the full model Jj\/^ and the full projection model J in the candidate 
model list, as described before Theorem 10 . 

Proof of Theorem 10 : 

Proof. Observe that for the full projection model J, with the chosen Cj, we have that 

(1 - (6ti + 3T2))nR„{J) < ABC' (J) < CnR,,{J) = ^ {na^ + Xa^Cj) 

for some constant ^ > that depends only on A, and a^. From the remark after Proposition 7, 
we have the following risk bounds for the three situations. Below B and B' are constants depending 
only on A, a^, and a^. 

1. When M„ < n/2, we have the general risk bound 
E{ASE{fj,)) 

< B'(^ inf f||/..-/o1i: + ^ + ^^)A(||/.,.„-/o"||%^ 
ARn{J) A RMo)) 



J,„:l<m<M„ V " n 



n I n I \\ n 

For f^ e Tq{tn] Mn), from above, by an argument similar to that in Theorem 7, for any 
1 < m < Af„, there exists a subset Jm and /gm g J^j^ such that 

EiASEif,)) < f f + ^ + ^Mili^^") A ^1 

AB'((4 + ^)Aa2). (8.7) 

2 

When m* = m* = Af„, the full model Jm„ leads to an upper bound of order When 
1 < < M„, we get the desired upper bound by taking the smaller value of the index of 
resolvability at J,„_^ and Jj\/„. When m.^ = 1, the smaller value of the index of resolvability at 
Jo and Jm^ results in the given upper bound. 
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The arguments for cases F ~ J-o{kn', Mn) and J- — Fq{tn]Mn) D To{kn; Mn) are similar to 
those of Theorem 7. 
2. When A/„ > n/2 and rM„ > evahiating the index of resolvabiUty gives 

E{ASE{fj,)) 

< b( inf , + Ai?4J)Ai?„(Jo) 



./„,:l<m<n/2 



2^- aHog[n/2\ , a'\og{'^) 



< B' inf 

J™:l<m<n/2 \ 71 

AS'((||/,7o-/fll?. + ^) Aa^ 

In this case, for the full model, clearly, we have Wfj^,^ — foWn + > jcr^, which cannot be 

better than the model J up to a constant factor. We next show that adding the models with size 
n/2 < m < Mn does not help either in terms of the rate in the risk bound. If rj^^ > rj\/^/2, 
then obviously \\fj^ - f^^\\l + ^ + + > i^'- For rj„^ < rM„/2, 

if 7i/2 < m < M„/2, then there exists a smaller model with size m < n/2 that has the 
same approximation error and rank, but smaller complexity Cj~ (i.e., Cj^ < Cj^), where 
Cj^ = log(n A M„) + log (^^") when m > n/2. If m > A/„/2 (and rj,^ < rM„/2), then due to 
the monotonicity of the function (^^") in m > A/„/2, since there must be more than ri^i^/2 
terms left out in the model, we must have log (^^") > log (^^ /2j) — L''Ji/,i/2j log y^.^l"/2\ ' 

which is at least of order n under the condition rjv/^^ > n/2. Putting the above facts together, 
wc conclude that adding the models with size n/2 < m < A/„ docs not affect the validity of 
the risk bound given in part (ii) of Theorem 10 (note that log[n/2j is of the same order as 
log(M„ A n) in our case). Then, the general risk upper bound becomes (with B' enlarged by 



Z.Wang, S.Paterlini, F. Gao and Y.Yang/ Adaptive Minimax Estimation over Sparse tq-HuUs 68 

an absolute constant factor) 

B' M f||/,„,-/„-.||; + ;!r£.-,''MM„An)^,'logm\ 



J„:l<m<A/„ 



AS' Who - /^l^ + - A A B' - ml + 




n J J \ n 



,/m:l<m<J\f„ Y " n 

AB'((!l/7„-/oll?. + ^)Aa2 

For /q G Fq{tn\Mn) and any 1 < rn < A/„, there exists a subset Jm and /gm e Fj^ such 
that the inequahty (8.7) holds. When = m* ~ Mn A ?i, the full projection model J leads 
to an upper bound of order . When 1 < m* < A/„ A n, we get the desired upper bounds 
by choosing Jm, and J to evaluate the index of resolvability. When m.^ = 1, models Jq and J 
result in the desired upper bound. 
3. When Mn > n/2 and r^j^ < n/2, the full model is already included, and, similarly as above, 
the models with n/2 < m < Mn can be included in the minimization set of the general risk 

_ 2 

bound. Indeed, if rj\/^ = 1, the statement is trivial. If rj,^ > rj\/^/2, then 1 1 /j„ ~ /o' H « + n^'" + 
a- iogLn/2j ^ "' > II f/,, — fA'll^ + ^-T^, which meaus that the model cannot beat 

the full model up to a constant factor. For rj^ < rM„/2, if m > M„/2, then we again have 
( m") ^ (a/„-L^L„/2j) >L^a/,./2J log *^"/2j ■ Thus there exists a model in F^^ with the 
same rank of rj^ < n/2 and approximation error, and its complexity is at most at the same 
order as Jm- Then with the same arguments for the case of > '^■72, we again conclude 
that adding the models with size n/2 < m < Mn does not affect the validity of the risk bound 
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given in part (ii) of Theorem 10. Thus, the general risk bound is 
E{ASE{fj,)) 

^ ^1 - /o L, + , , inf , /j™ - .fjM„ L + — + A 



,7„,:l<Tn<ri/2 

AB( (ll/,/o-/oll'. + -) 



<- R'fllf f"l|2 , -f /^iif f ||2 , o-^r/,„ , o-2log(M„An) 



J,„:l<m<M„ V " n n 



For /q' G J-q{tn] Mn) and any 1 < m < A/„, there exists a subset J™ and /g™ G such that 
the inequahty (8.7) holds. When m* = m* = Af„ A n, the full model Ja/„ leads to an upper 

2 

bound of order ^-^p^- When 1 < to* < A/„ A ti, we get the desired upper bounds by choosing 
Jm, and Ja/„ when evaluating the index of resolvability. When m* = 1, taking models Jq and 
Jm„ results in the desired upper bound. 

□ 



Proof of Theorem 11: 

Proof The proof is similar to that of Theorem 10 except that we use the oracle inequality (4.7) in 
[7] instead of that in Proposition 7 (and there is no need to consider the different scenarios). Note 
that if Mn < {n — 7) A <rn, then m V log (^^") < for all 1 < to < A/„. Thus all subset models are 
allowed by the BGH criterion. When M„ is larger, however, the conditions required in Corollary 1 
of [7] may invalidate the choice of to* or fc„ when it is too large, hence the upper bound assumption 
on TO* and fc„ . We skip the details of the proof. 

□ 
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